Backlink URL Classification Implementation Plan
The Problem
Current system classifies referring domains (e.g., techcrunch.com) directly, but domains alone don't provide enough context. Result: ~70%+ of classifications fall through to expensive LLM stage.
What we have:
referring_domainstable with domain-level aggregates- Classification pipeline (Rules → Vectorize → Content → LLM)
urlstable with classification columns (but not populated)backlinkstable schema (empty, never used)ensureUrl()function with hash-based deduplication
What's missing:
- Code to fetch actual backlink URLs from DataForSEO
backlinks/backlinks/live - Code to store backlinks in
backlinkstable - Code to classify individual URLs and store in
urlstable - Code to bubble up URL classifications to domain level
The Solution
Data Flow (New)
1. Fetch backlinks for target domain
└─> DataForSEO backlinks/backlinks/live ($0.02 + $0.0001/row)
2. Store each backlink
└─> Insert source URL into `urls` table (dedupe via hash)
└─> Insert backlink record into `backlinks` table
3. Classify unclassified URLs
└─> Queue job for each URL where page_type IS NULL
└─> Run 4-stage pipeline with FULL URL context (not just domain)
└─> Store classification on `urls` record
4. Bubble up to domain
└─> When domain has N+ classified URLs, aggregate
└─> Majority vote / weighted average for domain_type, channel_bucket, etc.
└─> Update `domains` table with aggregated classification
Implementation Steps
Phase 1: Fetch & Store Backlinks
1.1 Add getBacklinks() function to dataforseo-backlinks.js
Calls backlinks/backlinks/live API. Returns individual backlink records with:
url_from- The page linking to target (THIS IS THE GOLD)url_to- Target page receiving the linkanchor- Anchor text (classification signal)domain_from- Source domaindomain_from_rank- Domain authorityis_new,is_lost- Link statuspage_from_rank- Page-level authoritydofollow- Link typetext_pre,text_post- Surrounding context (classification signal!)
Cost: $0.02 base + $0.0001 per backlink row
1.2 Add storeBacklinks() function
For each backlink from API:
- Call
ensureUrl(url_from)- stores inurlstable with hash deduplication - Call
ensureUrl(url_to)- stores target URL - Insert into
backlinkstable:source_url_id→urls.idfor url_fromtarget_url_id→urls.idfor url_toanchor_text→ anchoris_dofollow→ dofollowref_domain_rank→ domain_from_rankdiscovered_at,last_seen_attimestamps
1.3 Add admin endpoint
POST /api/admin/backlinks/fetch
{
"target": "spotify.com",
"limit": 1000,
"order_by": ["rank,desc"]
}
Returns: { fetched: 1000, urls_created: 847, backlinks_created: 1000 }
Phase 2: URL Classification Queue
2.1 Add queueUnclassifiedUrls() function
Query:
SELECT u.id, u.url, u.domain, d.domain_rank
FROM urls u
LEFT JOIN domains d ON u.domain_id = d.id
WHERE u.page_type IS NULL
AND u.url_type = 'backlink_source' -- Only classify backlink sources
ORDER BY d.domain_rank DESC NULLS LAST
LIMIT ?
Queue message:
{
"type": "classify_url",
"url_id": 12345,
"url": "https://techcrunch.com/2024/01/15/startup-raises-10m/",
"domain": "techcrunch.com",
"domain_rank": 92
}
2.2 Modify backlink-classify-consumer.js
Add handler for classify_url message type:
async function processClassifyUrl(body, env) {
const { url_id, url, domain, domain_rank } = body;
// Run 4-stage classification (existing pipeline)
const result = await classifyUrl({ url, domain, domain_rank }, {}, env);
// Store classification on urls table
await env.DB.prepare(`
UPDATE urls SET
page_type = ?,
tactic_type = ?,
channel_bucket = ?,
media_type = ?,
ownership_type = ?,
quality_tier = ?,
domain_rank = ?,
modifiers = ?,
classification_source = ?,
classification_confidence = ?,
classifier_version = ?,
llm_reasoning = ?,
updated_at = ?
WHERE id = ?
`).bind(
result.classification.page_type,
result.classification.tactic_type,
result.classification.channel_bucket,
result.classification.media_type,
result.classification.ownership_type,
result.classification.quality_tier,
domain_rank,
JSON.stringify(result.classification.modifiers || []),
result.classification.llm_used ? 'llm' : 'rules_vectorize',
result.final_confidence,
2,
result.classification.llm_reasoning || null,
Date.now(),
url_id
).run();
}
2.3 Add admin endpoint to trigger classification
POST /api/admin/backlinks/classify-urls
{
"target": "spotify.com", // Optional: only URLs linking to this target
"limit": 1000,
"min_domain_rank": 20 // Optional: prioritize high-authority domains
}
Phase 3: Domain Bubble-Up
3.1 Add aggregateDomainClassification() function
When a domain has sufficient classified URLs (e.g., 5+), compute aggregate:
async function aggregateDomainClassification(domainId, env) {
// Get classification distribution for URLs from this domain
const stats = await env.DB.prepare(`
SELECT
domain_type,
channel_bucket,
media_type,
ownership_type,
COUNT(*) as count,
AVG(classification_confidence) as avg_confidence,
AVG(domain_rank) as avg_rank
FROM urls
WHERE domain_id = ?
AND page_type IS NOT NULL
GROUP BY domain_type, channel_bucket, media_type, ownership_type
ORDER BY count DESC
`).bind(domainId).all();
if (stats.results.length === 0) return null;
// Majority vote with confidence weighting
const topResult = stats.results[0];
const totalUrls = stats.results.reduce((sum, r) => sum + r.count, 0);
const dominance = topResult.count / totalUrls;
// Only update domain if we have strong signal
if (totalUrls >= 5 && dominance >= 0.5) {
await env.DB.prepare(`
UPDATE domains SET
domain_type = ?,
channel_bucket = ?,
media_type = ?,
ownership_type = ?,
classification_source = 'aggregated',
classification_confidence = ?,
classifier_version = 2,
updated_at = ?
WHERE id = ?
`).bind(
topResult.domain_type,
topResult.channel_bucket,
topResult.media_type,
topResult.ownership_type,
dominance * topResult.avg_confidence,
Date.now(),
domainId
).run();
return { domain_id: domainId, classification: topResult, url_count: totalUrls };
}
return null;
}
3.2 Trigger bubble-up after URL classification
In processClassifyUrl(), after storing URL classification:
// Check if we should update domain classification
const urlCount = await env.DB.prepare(`
SELECT COUNT(*) as count FROM urls
WHERE domain_id = ? AND page_type IS NOT NULL
`).bind(domain_id).first();
if (urlCount.count >= 5 && urlCount.count % 5 === 0) {
// Every 5 new URL classifications, re-aggregate domain
await aggregateDomainClassification(domain_id, env);
}
3.3 Add admin endpoint for manual bubble-up
POST /api/admin/backlinks/aggregate-domains
{
"min_urls": 5, // Minimum classified URLs required
"min_dominance": 0.5, // Minimum majority threshold
"limit": 100 // Domains to process
}
Phase 4: Re-classification Handling
4.1 Skip already-classified URLs
In queue logic, skip URLs that already have classification:
WHERE u.page_type IS NULL
OR (u.classification_confidence < 70 AND u.classifier_version < 2)
4.2 Handle URL updates
When we fetch backlinks again and see a URL we already have:
ensureUrl()already handles deduplication via hashON CONFLICTupdateslast_seenbut preserves classification- Only re-classify if
classification_confidence < threshold
4.3 Incremental backlink fetching
Track what we've fetched per target:
-- In referring_domains or new tracking table
last_backlinks_fetch_ts INTEGER,
backlinks_fetch_offset INTEGER
On subsequent fetches, use offset or filter by first_seen date.
Database Changes Required
Migration: Add tracking columns
-- Track URL source type
ALTER TABLE urls ADD COLUMN url_type TEXT; -- 'backlink_source', 'backlink_target', 'serp_result', etc.
-- Already exists from migration 0079:
-- page_type, tactic_type, channel_bucket, media_type, ownership_type
-- quality_tier, domain_rank, classification_source, classification_confidence
-- classifier_version, llm_reasoning, modifiers
-- Add index for unclassified URL queries
CREATE INDEX IF NOT EXISTS idx_urls_unclassified
ON urls(page_type, url_type, domain_id)
WHERE page_type IS NULL;
-- Add index for domain aggregation
CREATE INDEX IF NOT EXISTS idx_urls_domain_classified
ON urls(domain_id, page_type, channel_bucket)
WHERE page_type IS NOT NULL;
Verify backlinks table schema
-- From migration 020, verify these columns exist:
-- id, source_url_id, target_url_id, anchor_text, discovered_at, last_seen_at
-- From migration 0079, should have:
-- referring_domain_id, target_domain_id, is_dofollow, link_strength
-- ref_domain_type, ref_channel_bucket, ref_media_type, ref_ownership_type
-- ref_page_type, ref_tactic_type, ref_quality_tier, ref_domain_rank
-- classification_source, classification_confidence, classifier_version
Cost Analysis
Current (Domain-only classification)
- Referring domains API: $0.02 + $0.00003/domain
- LLM fallback rate: ~70%
- LLM cost per domain: ~$0.0001
- Total for 1000 domains: $0.02 + $0.03 + $0.07 = $0.12
New (URL-based classification)
- Backlinks API: $0.02 + $0.0001/backlink
- For 1000 backlinks → 847 unique URLs (estimated)
- URL classification:
- Rules only (est. 40%): 339 URLs × $0 = $0
- Rules + Vectorize (est. 35%): 296 URLs × $0.00001 = $0.003
- Rules + Vectorize + Content (est. 20%): 169 URLs × $0.000125 = $0.02
- Full pipeline with LLM (est. 5%): 42 URLs × $0.0002 = $0.008
- Total for 1000 backlinks: $0.02 + $0.10 + $0.031 = $0.15
BUT: We get 847 classified URLs that improve future classification AND reduce LLM calls for domain classification. Net ROI is positive after ~2 rounds.
Files to Create/Modify
New Functions in src/lib/dataforseo-backlinks.js
getBacklinks(target, options, env)- Fetch from APIstoreBacklinks(items, targetDomain, env)- Store in DBfetchAndStoreBacklinks(target, options, env)- Combined flow
New Functions in src/lib/url-classification.js (NEW FILE)
queueUnclassifiedUrls(options, env)- Queue URLs for classificationaggregateDomainClassification(domainId, env)- Bubble up to domaingetUrlClassificationStats(domainId, env)- Stats for a domain's URLs
Modify src/queue/backlink-classify-consumer.js
- Add
classify_urlmessage handler - Trigger domain aggregation after URL classification
Modify src/endpoints/admin-referring-domains.js
- Add
POST /fetch-backlinksendpoint - Add
POST /classify-urlsendpoint - Add
POST /aggregate-domainsendpoint - Add
GET /url-classification-statsendpoint
New Migration
migrations/0095_url_classification_indexes.sql
Testing Plan
Unit Tests
getBacklinks()returns correct structure from APIstoreBacklinks()deduplicates URLs correctlyaggregateDomainClassification()computes majority correctly- Classification preserves existing data on re-fetch
Integration Tests
- Fetch 100 backlinks for test domain
- Verify URLs created in
urlstable - Verify backlinks created in
backlinkstable - Trigger classification queue
- Verify URL classifications stored
- Verify domain bubble-up works
Manual Testing
# 1. Fetch backlinks
curl -X POST https://worker/api/admin/backlinks/fetch \
-d '{"target": "spotify.com", "limit": 100}'
# 2. Queue URL classification
curl -X POST https://worker/api/admin/backlinks/classify-urls \
-d '{"limit": 100}'
# 3. Check classification stats
curl "https://worker/api/admin/backlinks/url-stats?domain=techcrunch.com"
# 4. Trigger domain aggregation
curl -X POST https://worker/api/admin/backlinks/aggregate-domains \
-d '{"min_urls": 5}'
Rollout Plan
Step 1: Deploy infrastructure (no behavior change)
- Add new functions to
dataforseo-backlinks.js - Add new
url-classification.jsfile - Add migration for indexes
- Deploy
Step 2: Test with single target
- Fetch 1000 backlinks for one test domain
- Run URL classification
- Verify results
- Check LLM fallback rate (should be less than 30%)
Step 3: Backfill existing referring domains
- For each domain in
referring_domainstable - Fetch top 100 backlinks by rank
- Queue for URL classification
- Aggregate to domain
Step 4: Integrate into regular flow
- When fetching new referring domains, also fetch sample backlinks
- Auto-queue URL classification
- Auto-aggregate after threshold reached
Success Metrics
- LLM fallback rate: Should drop from ~70% to under 30%
- Classification confidence: Should increase from avg 55% to avg 75%
- Domain classification accuracy: Spot-check 100 domains, target 90%+ accuracy
- Cost per classification: Should stay under $0.0002 average