Skip to main content

Backlink URL Classification Implementation Plan

The Problem

Current system classifies referring domains (e.g., techcrunch.com) directly, but domains alone don't provide enough context. Result: ~70%+ of classifications fall through to expensive LLM stage.

What we have:

  • referring_domains table with domain-level aggregates
  • Classification pipeline (Rules → Vectorize → Content → LLM)
  • urls table with classification columns (but not populated)
  • backlinks table schema (empty, never used)
  • ensureUrl() function with hash-based deduplication

What's missing:

  • Code to fetch actual backlink URLs from DataForSEO backlinks/backlinks/live
  • Code to store backlinks in backlinks table
  • Code to classify individual URLs and store in urls table
  • Code to bubble up URL classifications to domain level

The Solution

Data Flow (New)

1. Fetch backlinks for target domain
└─> DataForSEO backlinks/backlinks/live ($0.02 + $0.0001/row)

2. Store each backlink
└─> Insert source URL into `urls` table (dedupe via hash)
└─> Insert backlink record into `backlinks` table

3. Classify unclassified URLs
└─> Queue job for each URL where page_type IS NULL
└─> Run 4-stage pipeline with FULL URL context (not just domain)
└─> Store classification on `urls` record

4. Bubble up to domain
└─> When domain has N+ classified URLs, aggregate
└─> Majority vote / weighted average for domain_type, channel_bucket, etc.
└─> Update `domains` table with aggregated classification

Implementation Steps

Calls backlinks/backlinks/live API. Returns individual backlink records with:

  • url_from - The page linking to target (THIS IS THE GOLD)
  • url_to - Target page receiving the link
  • anchor - Anchor text (classification signal)
  • domain_from - Source domain
  • domain_from_rank - Domain authority
  • is_new, is_lost - Link status
  • page_from_rank - Page-level authority
  • dofollow - Link type
  • text_pre, text_post - Surrounding context (classification signal!)

Cost: $0.02 base + $0.0001 per backlink row

For each backlink from API:

  1. Call ensureUrl(url_from) - stores in urls table with hash deduplication
  2. Call ensureUrl(url_to) - stores target URL
  3. Insert into backlinks table:
    • source_url_idurls.id for url_from
    • target_url_idurls.id for url_to
    • anchor_text → anchor
    • is_dofollow → dofollow
    • ref_domain_rank → domain_from_rank
    • discovered_at, last_seen_at timestamps

1.3 Add admin endpoint

POST /api/admin/backlinks/fetch

{
"target": "spotify.com",
"limit": 1000,
"order_by": ["rank,desc"]
}

Returns: { fetched: 1000, urls_created: 847, backlinks_created: 1000 }


Phase 2: URL Classification Queue

2.1 Add queueUnclassifiedUrls() function

Query:

SELECT u.id, u.url, u.domain, d.domain_rank
FROM urls u
LEFT JOIN domains d ON u.domain_id = d.id
WHERE u.page_type IS NULL
AND u.url_type = 'backlink_source' -- Only classify backlink sources
ORDER BY d.domain_rank DESC NULLS LAST
LIMIT ?

Queue message:

{
"type": "classify_url",
"url_id": 12345,
"url": "https://techcrunch.com/2024/01/15/startup-raises-10m/",
"domain": "techcrunch.com",
"domain_rank": 92
}

Add handler for classify_url message type:

async function processClassifyUrl(body, env) {
const { url_id, url, domain, domain_rank } = body;

// Run 4-stage classification (existing pipeline)
const result = await classifyUrl({ url, domain, domain_rank }, {}, env);

// Store classification on urls table
await env.DB.prepare(`
UPDATE urls SET
page_type = ?,
tactic_type = ?,
channel_bucket = ?,
media_type = ?,
ownership_type = ?,
quality_tier = ?,
domain_rank = ?,
modifiers = ?,
classification_source = ?,
classification_confidence = ?,
classifier_version = ?,
llm_reasoning = ?,
updated_at = ?
WHERE id = ?
`).bind(
result.classification.page_type,
result.classification.tactic_type,
result.classification.channel_bucket,
result.classification.media_type,
result.classification.ownership_type,
result.classification.quality_tier,
domain_rank,
JSON.stringify(result.classification.modifiers || []),
result.classification.llm_used ? 'llm' : 'rules_vectorize',
result.final_confidence,
2,
result.classification.llm_reasoning || null,
Date.now(),
url_id
).run();
}

2.3 Add admin endpoint to trigger classification

POST /api/admin/backlinks/classify-urls

{
"target": "spotify.com", // Optional: only URLs linking to this target
"limit": 1000,
"min_domain_rank": 20 // Optional: prioritize high-authority domains
}

Phase 3: Domain Bubble-Up

3.1 Add aggregateDomainClassification() function

When a domain has sufficient classified URLs (e.g., 5+), compute aggregate:

async function aggregateDomainClassification(domainId, env) {
// Get classification distribution for URLs from this domain
const stats = await env.DB.prepare(`
SELECT
domain_type,
channel_bucket,
media_type,
ownership_type,
COUNT(*) as count,
AVG(classification_confidence) as avg_confidence,
AVG(domain_rank) as avg_rank
FROM urls
WHERE domain_id = ?
AND page_type IS NOT NULL
GROUP BY domain_type, channel_bucket, media_type, ownership_type
ORDER BY count DESC
`).bind(domainId).all();

if (stats.results.length === 0) return null;

// Majority vote with confidence weighting
const topResult = stats.results[0];
const totalUrls = stats.results.reduce((sum, r) => sum + r.count, 0);
const dominance = topResult.count / totalUrls;

// Only update domain if we have strong signal
if (totalUrls >= 5 && dominance >= 0.5) {
await env.DB.prepare(`
UPDATE domains SET
domain_type = ?,
channel_bucket = ?,
media_type = ?,
ownership_type = ?,
classification_source = 'aggregated',
classification_confidence = ?,
classifier_version = 2,
updated_at = ?
WHERE id = ?
`).bind(
topResult.domain_type,
topResult.channel_bucket,
topResult.media_type,
topResult.ownership_type,
dominance * topResult.avg_confidence,
Date.now(),
domainId
).run();

return { domain_id: domainId, classification: topResult, url_count: totalUrls };
}

return null;
}

3.2 Trigger bubble-up after URL classification

In processClassifyUrl(), after storing URL classification:

// Check if we should update domain classification
const urlCount = await env.DB.prepare(`
SELECT COUNT(*) as count FROM urls
WHERE domain_id = ? AND page_type IS NOT NULL
`).bind(domain_id).first();

if (urlCount.count >= 5 && urlCount.count % 5 === 0) {
// Every 5 new URL classifications, re-aggregate domain
await aggregateDomainClassification(domain_id, env);
}

3.3 Add admin endpoint for manual bubble-up

POST /api/admin/backlinks/aggregate-domains

{
"min_urls": 5, // Minimum classified URLs required
"min_dominance": 0.5, // Minimum majority threshold
"limit": 100 // Domains to process
}

Phase 4: Re-classification Handling

4.1 Skip already-classified URLs

In queue logic, skip URLs that already have classification:

WHERE u.page_type IS NULL
OR (u.classification_confidence < 70 AND u.classifier_version < 2)

4.2 Handle URL updates

When we fetch backlinks again and see a URL we already have:

  • ensureUrl() already handles deduplication via hash
  • ON CONFLICT updates last_seen but preserves classification
  • Only re-classify if classification_confidence < threshold

Track what we've fetched per target:

-- In referring_domains or new tracking table
last_backlinks_fetch_ts INTEGER,
backlinks_fetch_offset INTEGER

On subsequent fetches, use offset or filter by first_seen date.


Database Changes Required

Migration: Add tracking columns

-- Track URL source type
ALTER TABLE urls ADD COLUMN url_type TEXT; -- 'backlink_source', 'backlink_target', 'serp_result', etc.

-- Already exists from migration 0079:
-- page_type, tactic_type, channel_bucket, media_type, ownership_type
-- quality_tier, domain_rank, classification_source, classification_confidence
-- classifier_version, llm_reasoning, modifiers

-- Add index for unclassified URL queries
CREATE INDEX IF NOT EXISTS idx_urls_unclassified
ON urls(page_type, url_type, domain_id)
WHERE page_type IS NULL;

-- Add index for domain aggregation
CREATE INDEX IF NOT EXISTS idx_urls_domain_classified
ON urls(domain_id, page_type, channel_bucket)
WHERE page_type IS NOT NULL;
-- From migration 020, verify these columns exist:
-- id, source_url_id, target_url_id, anchor_text, discovered_at, last_seen_at

-- From migration 0079, should have:
-- referring_domain_id, target_domain_id, is_dofollow, link_strength
-- ref_domain_type, ref_channel_bucket, ref_media_type, ref_ownership_type
-- ref_page_type, ref_tactic_type, ref_quality_tier, ref_domain_rank
-- classification_source, classification_confidence, classifier_version

Cost Analysis

Current (Domain-only classification)

  • Referring domains API: $0.02 + $0.00003/domain
  • LLM fallback rate: ~70%
  • LLM cost per domain: ~$0.0001
  • Total for 1000 domains: $0.02 + $0.03 + $0.07 = $0.12

New (URL-based classification)

  • Backlinks API: $0.02 + $0.0001/backlink
  • For 1000 backlinks → 847 unique URLs (estimated)
  • URL classification:
    • Rules only (est. 40%): 339 URLs × $0 = $0
    • Rules + Vectorize (est. 35%): 296 URLs × $0.00001 = $0.003
    • Rules + Vectorize + Content (est. 20%): 169 URLs × $0.000125 = $0.02
    • Full pipeline with LLM (est. 5%): 42 URLs × $0.0002 = $0.008
  • Total for 1000 backlinks: $0.02 + $0.10 + $0.031 = $0.15

BUT: We get 847 classified URLs that improve future classification AND reduce LLM calls for domain classification. Net ROI is positive after ~2 rounds.


Files to Create/Modify

New Functions in src/lib/dataforseo-backlinks.js

  • getBacklinks(target, options, env) - Fetch from API
  • storeBacklinks(items, targetDomain, env) - Store in DB
  • fetchAndStoreBacklinks(target, options, env) - Combined flow

New Functions in src/lib/url-classification.js (NEW FILE)

  • queueUnclassifiedUrls(options, env) - Queue URLs for classification
  • aggregateDomainClassification(domainId, env) - Bubble up to domain
  • getUrlClassificationStats(domainId, env) - Stats for a domain's URLs
  • Add classify_url message handler
  • Trigger domain aggregation after URL classification

Modify src/endpoints/admin-referring-domains.js

  • Add POST /fetch-backlinks endpoint
  • Add POST /classify-urls endpoint
  • Add POST /aggregate-domains endpoint
  • Add GET /url-classification-stats endpoint

New Migration

  • migrations/0095_url_classification_indexes.sql

Testing Plan

Unit Tests

  1. getBacklinks() returns correct structure from API
  2. storeBacklinks() deduplicates URLs correctly
  3. aggregateDomainClassification() computes majority correctly
  4. Classification preserves existing data on re-fetch

Integration Tests

  1. Fetch 100 backlinks for test domain
  2. Verify URLs created in urls table
  3. Verify backlinks created in backlinks table
  4. Trigger classification queue
  5. Verify URL classifications stored
  6. Verify domain bubble-up works

Manual Testing

# 1. Fetch backlinks
curl -X POST https://worker/api/admin/backlinks/fetch \
-d '{"target": "spotify.com", "limit": 100}'

# 2. Queue URL classification
curl -X POST https://worker/api/admin/backlinks/classify-urls \
-d '{"limit": 100}'

# 3. Check classification stats
curl "https://worker/api/admin/backlinks/url-stats?domain=techcrunch.com"

# 4. Trigger domain aggregation
curl -X POST https://worker/api/admin/backlinks/aggregate-domains \
-d '{"min_urls": 5}'

Rollout Plan

Step 1: Deploy infrastructure (no behavior change)

  • Add new functions to dataforseo-backlinks.js
  • Add new url-classification.js file
  • Add migration for indexes
  • Deploy

Step 2: Test with single target

  • Fetch 1000 backlinks for one test domain
  • Run URL classification
  • Verify results
  • Check LLM fallback rate (should be less than 30%)

Step 3: Backfill existing referring domains

  • For each domain in referring_domains table
  • Fetch top 100 backlinks by rank
  • Queue for URL classification
  • Aggregate to domain

Step 4: Integrate into regular flow

  • When fetching new referring domains, also fetch sample backlinks
  • Auto-queue URL classification
  • Auto-aggregate after threshold reached

Success Metrics

  1. LLM fallback rate: Should drop from ~70% to under 30%
  2. Classification confidence: Should increase from avg 55% to avg 75%
  3. Domain classification accuracy: Spot-check 100 domains, target 90%+ accuracy
  4. Cost per classification: Should stay under $0.0002 average