Backlink URL Classification Implementation Plan

The Problem

Current system classifies referring domains (e.g., techcrunch.com) directly, but domains alone don't provide enough context. Result: ~70%+ of classifications fall through to expensive LLM stage.

What we have:

referring_domains table with domain-level aggregates
Classification pipeline (Rules → Vectorize → Content → LLM)
urls table with classification columns (but not populated)
backlinks table schema (empty, never used)
ensureUrl() function with hash-based deduplication

What's missing:

Code to fetch actual backlink URLs from DataForSEO backlinks/backlinks/live
Code to store backlinks in backlinks table
Code to classify individual URLs and store in urls table
Code to bubble up URL classifications to domain level

The Solution

Data Flow (New)

1. Fetch backlinks for target domain
   └─> DataForSEO backlinks/backlinks/live ($0.02 + $0.0001/row)
   
2. Store each backlink
   └─> Insert source URL into `urls` table (dedupe via hash)
   └─> Insert backlink record into `backlinks` table
   
3. Classify unclassified URLs
   └─> Queue job for each URL where page_type IS NULL
   └─> Run 4-stage pipeline with FULL URL context (not just domain)
   └─> Store classification on `urls` record
   
4. Bubble up to domain
   └─> When domain has N+ classified URLs, aggregate
   └─> Majority vote / weighted average for domain_type, channel_bucket, etc.
   └─> Update `domains` table with aggregated classification

Implementation Steps

Phase 1: Fetch & Store Backlinks

1.1 Add `getBacklinks()` function to `dataforseo-backlinks.js`

Calls backlinks/backlinks/live API. Returns individual backlink records with:

url_from - The page linking to target (THIS IS THE GOLD)
url_to - Target page receiving the link
anchor - Anchor text (classification signal)
domain_from - Source domain
domain_from_rank - Domain authority
is_new, is_lost - Link status
page_from_rank - Page-level authority
dofollow - Link type
text_pre, text_post - Surrounding context (classification signal!)

Cost: $0.02 base + $0.0001 per backlink row

1.2 Add `storeBacklinks()` function

For each backlink from API:

Call ensureUrl(url_from) - stores in urls table with hash deduplication
Call ensureUrl(url_to) - stores target URL
Insert into backlinks table:
- source_url_id → urls.id for url_from
- target_url_id → urls.id for url_to
- anchor_text → anchor
- is_dofollow → dofollow
- ref_domain_rank → domain_from_rank
- discovered_at, last_seen_at timestamps

1.3 Add admin endpoint

POST /api/admin/backlinks/fetch

{
  "target": "spotify.com",
  "limit": 1000,
  "order_by": ["rank,desc"]
}

Returns: { fetched: 1000, urls_created: 847, backlinks_created: 1000 }

Phase 2: URL Classification Queue

2.1 Add `queueUnclassifiedUrls()` function

Query:

SELECT u.id, u.url, u.domain, d.domain_rank
FROM urls u
LEFT JOIN domains d ON u.domain_id = d.id
WHERE u.page_type IS NULL
  AND u.url_type = 'backlink_source'  -- Only classify backlink sources
ORDER BY d.domain_rank DESC NULLS LAST
LIMIT ?

Queue message:

{
  "type": "classify_url",
  "url_id": 12345,
  "url": "https://techcrunch.com/2024/01/15/startup-raises-10m/",
  "domain": "techcrunch.com",
  "domain_rank": 92
}

2.2 Modify `backlink-classify-consumer.js`

Add handler for classify_url message type:

async function processClassifyUrl(body, env) {
  const { url_id, url, domain, domain_rank } = body;
  
  // Run 4-stage classification (existing pipeline)
  const result = await classifyUrl({ url, domain, domain_rank }, {}, env);
  
  // Store classification on urls table
  await env.DB.prepare(`
    UPDATE urls SET
      page_type = ?,
      tactic_type = ?,
      channel_bucket = ?,
      media_type = ?,
      ownership_type = ?,
      quality_tier = ?,
      domain_rank = ?,
      modifiers = ?,
      classification_source = ?,
      classification_confidence = ?,
      classifier_version = ?,
      llm_reasoning = ?,
      updated_at = ?
    WHERE id = ?
  `).bind(
    result.classification.page_type,
    result.classification.tactic_type,
    result.classification.channel_bucket,
    result.classification.media_type,
    result.classification.ownership_type,
    result.classification.quality_tier,
    domain_rank,
    JSON.stringify(result.classification.modifiers || []),
    result.classification.llm_used ? 'llm' : 'rules_vectorize',
    result.final_confidence,
    2,
    result.classification.llm_reasoning || null,
    Date.now(),
    url_id
  ).run();
}

2.3 Add admin endpoint to trigger classification

POST /api/admin/backlinks/classify-urls

{
  "target": "spotify.com",  // Optional: only URLs linking to this target
  "limit": 1000,
  "min_domain_rank": 20     // Optional: prioritize high-authority domains
}

Phase 3: Domain Bubble-Up

3.1 Add `aggregateDomainClassification()` function

When a domain has sufficient classified URLs (e.g., 5+), compute aggregate:

async function aggregateDomainClassification(domainId, env) {
  // Get classification distribution for URLs from this domain
  const stats = await env.DB.prepare(`
    SELECT 
      domain_type,
      channel_bucket,
      media_type,
      ownership_type,
      COUNT(*) as count,
      AVG(classification_confidence) as avg_confidence,
      AVG(domain_rank) as avg_rank
    FROM urls
    WHERE domain_id = ?
      AND page_type IS NOT NULL
    GROUP BY domain_type, channel_bucket, media_type, ownership_type
    ORDER BY count DESC
  `).bind(domainId).all();
  
  if (stats.results.length === 0) return null;
  
  // Majority vote with confidence weighting
  const topResult = stats.results[0];
  const totalUrls = stats.results.reduce((sum, r) => sum + r.count, 0);
  const dominance = topResult.count / totalUrls;
  
  // Only update domain if we have strong signal
  if (totalUrls >= 5 && dominance >= 0.5) {
    await env.DB.prepare(`
      UPDATE domains SET
        domain_type = ?,
        channel_bucket = ?,
        media_type = ?,
        ownership_type = ?,
        classification_source = 'aggregated',
        classification_confidence = ?,
        classifier_version = 2,
        updated_at = ?
      WHERE id = ?
    `).bind(
      topResult.domain_type,
      topResult.channel_bucket,
      topResult.media_type,
      topResult.ownership_type,
      dominance * topResult.avg_confidence,
      Date.now(),
      domainId
    ).run();
    
    return { domain_id: domainId, classification: topResult, url_count: totalUrls };
  }
  
  return null;
}

3.2 Trigger bubble-up after URL classification

In processClassifyUrl(), after storing URL classification:

// Check if we should update domain classification
const urlCount = await env.DB.prepare(`
  SELECT COUNT(*) as count FROM urls 
  WHERE domain_id = ? AND page_type IS NOT NULL
`).bind(domain_id).first();

if (urlCount.count >= 5 && urlCount.count % 5 === 0) {
  // Every 5 new URL classifications, re-aggregate domain
  await aggregateDomainClassification(domain_id, env);
}

3.3 Add admin endpoint for manual bubble-up

POST /api/admin/backlinks/aggregate-domains

{
  "min_urls": 5,           // Minimum classified URLs required
  "min_dominance": 0.5,    // Minimum majority threshold
  "limit": 100             // Domains to process
}

Phase 4: Re-classification Handling

4.1 Skip already-classified URLs

In queue logic, skip URLs that already have classification:

WHERE u.page_type IS NULL
  OR (u.classification_confidence < 70 AND u.classifier_version < 2)

4.2 Handle URL updates

When we fetch backlinks again and see a URL we already have:

ensureUrl() already handles deduplication via hash
ON CONFLICT updates last_seen but preserves classification
Only re-classify if classification_confidence < threshold

4.3 Incremental backlink fetching

Track what we've fetched per target:

-- In referring_domains or new tracking table
last_backlinks_fetch_ts INTEGER,
backlinks_fetch_offset INTEGER

On subsequent fetches, use offset or filter by first_seen date.

Database Changes Required

Migration: Add tracking columns

-- Track URL source type
ALTER TABLE urls ADD COLUMN url_type TEXT;  -- 'backlink_source', 'backlink_target', 'serp_result', etc.

-- Already exists from migration 0079:
-- page_type, tactic_type, channel_bucket, media_type, ownership_type
-- quality_tier, domain_rank, classification_source, classification_confidence
-- classifier_version, llm_reasoning, modifiers

-- Add index for unclassified URL queries
CREATE INDEX IF NOT EXISTS idx_urls_unclassified 
ON urls(page_type, url_type, domain_id) 
WHERE page_type IS NULL;

-- Add index for domain aggregation
CREATE INDEX IF NOT EXISTS idx_urls_domain_classified 
ON urls(domain_id, page_type, channel_bucket) 
WHERE page_type IS NOT NULL;

Verify backlinks table schema

-- From migration 020, verify these columns exist:
-- id, source_url_id, target_url_id, anchor_text, discovered_at, last_seen_at

-- From migration 0079, should have:
-- referring_domain_id, target_domain_id, is_dofollow, link_strength
-- ref_domain_type, ref_channel_bucket, ref_media_type, ref_ownership_type
-- ref_page_type, ref_tactic_type, ref_quality_tier, ref_domain_rank
-- classification_source, classification_confidence, classifier_version

Cost Analysis

Current (Domain-only classification)

Referring domains API: $0.02 + $0.00003/domain
LLM fallback rate: ~70%
LLM cost per domain: ~$0.0001
Total for 1000 domains: $0.02 + $0.03 + $0.07 = $0.12

New (URL-based classification)

Backlinks API: $0.02 + $0.0001/backlink
For 1000 backlinks → 847 unique URLs (estimated)
URL classification:
- Rules only (est. 40%): 339 URLs × $0 = $0
- Rules + Vectorize (est. 35%): 296 URLs × $0.00001 = $0.003
- Rules + Vectorize + Content (est. 20%): 169 URLs × $0.000125 = $0.02
- Full pipeline with LLM (est. 5%): 42 URLs × $0.0002 = $0.008
Total for 1000 backlinks: $0.02 + $0.10 + $0.031 = $0.15

BUT: We get 847 classified URLs that improve future classification AND reduce LLM calls for domain classification. Net ROI is positive after ~2 rounds.

Files to Create/Modify

New Functions in `src/lib/dataforseo-backlinks.js`

getBacklinks(target, options, env) - Fetch from API
storeBacklinks(items, targetDomain, env) - Store in DB
fetchAndStoreBacklinks(target, options, env) - Combined flow

New Functions in `src/lib/url-classification.js` (NEW FILE)

queueUnclassifiedUrls(options, env) - Queue URLs for classification
aggregateDomainClassification(domainId, env) - Bubble up to domain
getUrlClassificationStats(domainId, env) - Stats for a domain's URLs

Modify `src/queue/backlink-classify-consumer.js`

Add classify_url message handler
Trigger domain aggregation after URL classification

Modify `src/endpoints/admin-referring-domains.js`

Add POST /fetch-backlinks endpoint
Add POST /classify-urls endpoint
Add POST /aggregate-domains endpoint
Add GET /url-classification-stats endpoint

New Migration

migrations/0095_url_classification_indexes.sql

Testing Plan

Unit Tests

getBacklinks() returns correct structure from API
storeBacklinks() deduplicates URLs correctly
aggregateDomainClassification() computes majority correctly
Classification preserves existing data on re-fetch

Integration Tests

Fetch 100 backlinks for test domain
Verify URLs created in urls table
Verify backlinks created in backlinks table
Trigger classification queue
Verify URL classifications stored
Verify domain bubble-up works

Manual Testing

# 1. Fetch backlinks
curl -X POST https://worker/api/admin/backlinks/fetch \
  -d '{"target": "spotify.com", "limit": 100}'

# 2. Queue URL classification
curl -X POST https://worker/api/admin/backlinks/classify-urls \
  -d '{"limit": 100}'

# 3. Check classification stats
curl "https://worker/api/admin/backlinks/url-stats?domain=techcrunch.com"

# 4. Trigger domain aggregation
curl -X POST https://worker/api/admin/backlinks/aggregate-domains \
  -d '{"min_urls": 5}'

Rollout Plan

Step 1: Deploy infrastructure (no behavior change)

Add new functions to dataforseo-backlinks.js
Add new url-classification.js file
Add migration for indexes
Deploy

Step 2: Test with single target

Fetch 1000 backlinks for one test domain
Run URL classification
Verify results
Check LLM fallback rate (should be less than 30%)

Step 3: Backfill existing referring domains

For each domain in referring_domains table
Fetch top 100 backlinks by rank
Queue for URL classification
Aggregate to domain

Step 4: Integrate into regular flow

When fetching new referring domains, also fetch sample backlinks
Auto-queue URL classification
Auto-aggregate after threshold reached

Success Metrics

LLM fallback rate: Should drop from ~70% to under 30%
Classification confidence: Should increase from avg 55% to avg 75%
Domain classification accuracy: Spot-check 100 domains, target 90%+ accuracy
Cost per classification: Should stay under $0.0002 average

The Problem​

The Solution​

Data Flow (New)​

Implementation Steps​

Phase 1: Fetch & Store Backlinks​

1.1 Add getBacklinks() function to dataforseo-backlinks.js​

1.2 Add storeBacklinks() function​

1.3 Add admin endpoint​

Phase 2: URL Classification Queue​

2.1 Add queueUnclassifiedUrls() function​

2.2 Modify backlink-classify-consumer.js​

2.3 Add admin endpoint to trigger classification​

Phase 3: Domain Bubble-Up​

3.1 Add aggregateDomainClassification() function​

3.2 Trigger bubble-up after URL classification​

3.3 Add admin endpoint for manual bubble-up​

Phase 4: Re-classification Handling​

4.1 Skip already-classified URLs​

4.2 Handle URL updates​

4.3 Incremental backlink fetching​

Database Changes Required​

Migration: Add tracking columns​

Verify backlinks table schema​

Cost Analysis​

Current (Domain-only classification)​

New (URL-based classification)​

Files to Create/Modify​

New Functions in src/lib/dataforseo-backlinks.js​

New Functions in src/lib/url-classification.js (NEW FILE)​

Modify src/queue/backlink-classify-consumer.js​

Modify src/endpoints/admin-referring-domains.js​

New Migration​

Testing Plan​

Unit Tests​

Integration Tests​

Manual Testing​

Rollout Plan​

Step 1: Deploy infrastructure (no behavior change)​

Step 2: Test with single target​

Step 3: Backfill existing referring domains​

Step 4: Integrate into regular flow​

Success Metrics​