Skip to main content

Backlink Intelligence System

Marketing DNA and backlink profile analysis using DataForSEO Backlinks API.


Overview

The backlink intelligence system provides:

  • Domain Summary Tracking - Weekly snapshots of backlink metrics ($0.02/request)
  • Referring Domains - Detailed per-domain backlink data ($0.02 + $0.00003/row)
  • Classification Pipeline - Domain/URL type classification (news, blog, affiliate, etc.)
  • OEPS Classification - Owned/Earned/Paid/Shared media type determination
  • Time-Series Analytics - Track backlink growth over time (like app rankings)

Architecture

┌─────────────────────────────────────────────────────────────────┐
│ Backlink Intelligence │
├─────────────────────────────────────────────────────────────────┤
│ Weekly Cron │
│ └─> For each tracked domain: │
│ └─> DataForSEO Summary ($0.02) │
│ └─> Store domain_summaries snapshot │
│ │
│ On-Demand Deep Pull │
│ └─> DataForSEO Referring Domains ($0.02 + $0.00003/row) │
│ └─> Store referring_domains (individual domain details) │
│ └─> Queue classification jobs │
└─────────────────────────────────────────────────────────────────┘

Data Model

Weekly Snapshots: domain_summaries

Denormalized weekly snapshots (like app_category_rankings).

CREATE TABLE domain_summaries (
id INTEGER PRIMARY KEY AUTOINCREMENT,
target_domain TEXT NOT NULL,
year_week INTEGER NOT NULL, -- YYYYWW format (e.g., 202450)

-- Core metrics from DataForSEO
rank INTEGER,
backlinks_count INTEGER,
referring_domains_count INTEGER,
spam_score REAL,

-- Platform type counts (from DataForSEO signals)
count_news INTEGER,
count_blogs INTEGER,
count_ecommerce INTEGER,
count_forums INTEGER,
count_social INTEGER,

-- OEPS media type counts (our classification)
count_owned INTEGER,
count_earned INTEGER,
count_paid INTEGER,
count_shared INTEGER,

-- Previous period tracking
prev_period_backlinks INTEGER,
prev_period_referring_domains INTEGER,
prev_period_year_week INTEGER,

UNIQUE(target_domain, year_week)
);

Key Points:

  • One row per domain per week
  • year_week in YYYYWW format (e.g., 202450 = week 50 of 2024)
  • Previous period columns enable delta calculations
  • Platform type counts from DataForSEO's referring_links_platform_types
  • OEPS counts populated by our classification pipeline

Detailed Data: referring_domains

Per-referring-domain details for deep analysis.

CREATE TABLE referring_domains (
target_domain TEXT NOT NULL,
referring_domain TEXT NOT NULL,

-- DataForSEO metrics
rank INTEGER,
backlinks_count INTEGER,
spam_score REAL,

-- Our classification (populated by classifier)
domain_type TEXT, -- news, blog, ecommerce, forum, etc.
media_type TEXT, -- owned, earned, paid, shared
channel_bucket TEXT, -- pr, affiliate, community, etc.

-- Timestamps
dfs_first_seen INTEGER,
dfs_lost_date INTEGER,
our_first_seen INTEGER,
our_last_seen INTEGER,

UNIQUE(target_domain, referring_domain)
);

API Endpoints

Fetch Domain Summary (Weekly Snapshot)

Endpoint: POST /api/admin/referring-domains/summary

Cost: $0.02 per request

curl -X POST https://your-worker.workers.dev/api/admin/referring-domains/summary \
-H "Content-Type: application/json" \
-d '{"target": "spotify.com"}'

Response:

{
"success": true,
"target": "spotify.com",
"year_week": 202450,
"inserted": true,
"cost": 0.02,
"summary": {
"backlinks": 45678901,
"referring_domains": 234567,
"rank": 89,
"spam_score": 12.5,
"platform_types": {
"blogs": 45000,
"news": 12000,
"ecommerce": 5000,
"message-boards": 3000,
"organization": 150000,
"unknown": 19567
}
}
}

Get Latest Summary

Endpoint: GET /api/admin/referring-domains/summary?target=spotify.com

{
"target": "spotify.com",
"current_week": 202450,
"summary": {
"year_week": 202450,
"backlinks_count": 45678901,
"referring_domains_count": 234567,
"rank": 89,
"spam_score": 12.5,
"links_platform_types": {...},
"prev_period_backlinks": 45000000,
"prev_period_referring_domains": 230000
},
"deltas": {
"backlinks_change": 678901,
"backlinks_change_pct": "1.51",
"referring_domains_change": 4567,
"rank_change": -2
}
}

Get Summary History

Endpoint: GET /api/admin/referring-domains/summary-history?target=spotify.com&limit=52

Returns up to 52 weeks of historical snapshots for trend analysis.

Fetch Referring Domains (Deep Pull)

Endpoint: POST /api/admin/referring-domains/fetch

Cost: $0.02 base + $0.00003 per row

curl -X POST https://your-worker.workers.dev/api/admin/referring-domains/fetch \
-H "Content-Type: application/json" \
-d '{
"target": "spotify.com",
"limit": 1000,
"offset": 0
}'

List Referring Domains

Endpoint: GET /api/admin/referring-domains/list?target=spotify.com&limit=100

Debug Raw API Response

Endpoint: POST /api/admin/referring-domains/debug

curl -X POST https://your-worker.workers.dev/api/admin/referring-domains/debug \
-H "Content-Type: application/json" \
-d '{"target": "spotify.com", "endpoint": "summary"}'

Classification Pipeline

DataForSEO Platform Types (Input Signals)

DataForSEO provides referring_links_platform_types:

TypeQualityNotes
newsGoodNews/media sites
blogsGoodBlog platforms
ecommerceGoodE-commerce sites
message-boardsGoodForums
socialGoodSocial platforms
wikisGoodWiki/reference
educationalGood.edu sites
governmentalGood.gov sites
directoryMediumWeb directories
organizationPoorCatch-all bucket
unknownPoorUnclassified
cmsPoorGeneric CMS sites

Our Classification Layer

We use DataForSEO signals as INPUT, then apply our own classification:

DataForSEO platform_types + Google Ads Categories


┌─────────────────────────────────────────────────────────────┐
│ FREE STAGES │
├─────────────────────────────────────────────────────────────┤
│ Rules Engine ← TLD checks, URL patterns, known domains │
│ Google Ads ← Cached category data → tier1_type hint │
│ Vectorize ← Semantic similarity to known examples │
│ Low-Noise Crawl ← HEAD + 8KB GET, CMS/og:type detection │
└────────┬────────────────────────────────────────────────────┘
│ (handles ~85%)

┌─────────────────────────────────────────────────────────────┐
│ PAID STAGES (if needed) │
├─────────────────────────────────────────────────────────────┤
│ Instant Pages ← $0.000125 - Full page fetch │
│ LLM Fallback ← ~$0.0001 - Workers AI for ambiguous │
└────────┬────────────────────────────────────────────────────┘
│ (handles ~15%)

Final Classification (property_type + tier1_type)

Domain Types (Our Taxonomy)

TypeDescription
newsNews/media publishers
blogPersonal/company blogs
ecommerceOnline stores
forumDiscussion forums
socialSocial platforms
wikiReference/wiki sites
eduEducational institutions
govGovernment sites
affiliateAffiliate/coupon sites
directoryWeb directories
saasSaaS products
agencyMarketing/PR agencies
otherUnclassified

OEPS Media Types

TypeDescriptionExamples
ownedCustomer's own propertiesCompany blog, product pages
earnedEditorial coverageNews articles, reviews
paidSponsored contentPaid placements, ads
sharedUser-generatedForum posts, social mentions

Cost Optimization

Summary vs Referring Domains

EndpointCostUse Case
Summary$0.02 fixedWeekly tracking, totals
Referring Domains$0.02 + $0.00003/rowDeep analysis, classification

Strategy:

  1. Use Summary for weekly snapshots (cheap, gives totals + platform_types distribution)
  2. Use Referring Domains only when you need individual domain details
  3. For large domains (100k+ referring domains), paginate with offset

Bulk Endpoints (Coming Soon)

DataForSEO offers bulk endpoints for cost optimization:

  • bulk_referring_domains - Up to 1000 targets per request
  • bulk_ranks - Quick rank checks

Weekly Tracking Flow

// Cron trigger (weekly)
async function weeklyBacklinkTracking(env) {
// Get tracked domains
const domains = await env.DB.prepare(
"SELECT DISTINCT target_domain FROM domain_summaries"
).all();

for (const { target_domain } of domains.results) {
// Fetch and store weekly snapshot
await fetchAndStoreDomainSummary(target_domain, {}, env);
}
}

Integration with Top Domains

The existing top-domains endpoint pulls from category_domain_metrics (organic SEO data).

Future Enhancement: Add backlink metrics to domain profiles:

-- Join domain_summaries with category_domain_metrics
SELECT
cdm.domain,
cdm.organic_etv,
ds.backlinks_count,
ds.referring_domains_count,
ds.spam_score
FROM category_domain_metrics cdm
LEFT JOIN domain_summaries ds ON cdm.domain = ds.target_domain
WHERE ds.year_week = (SELECT MAX(year_week) FROM domain_summaries WHERE target_domain = cdm.domain)

Classification System (Implemented)

The backlink classification system uses a two-tier approach:

  1. Domain Classification - Classify the domain once, cache it, reuse for all URLs
  2. URL Classification - Classify individual URLs, inheriting domain-level attributes

Domain Classification Pipeline

File: src/lib/domain-classifier.js

7-stage cost-optimized pipeline for classifying domains (FREE stages first, PAID only when needed):

flowchart LR
subgraph FREE["FREE Stages"]
S0[Cache] --> S1[Rules]
S1 --> S1_5[Google Ads<br/>Categories]
S1_5 --> S2[Vectorize]
S2 --> S3[Low-Noise<br/>Crawl]
end

subgraph PAID["PAID (only if needed)"]
S3 --> S4[Instant Pages]
S4 --> S4_5[Domain Patterns]
S4_5 --> S5[LLM]
end

S3 -->|"≥70%"| DONE[Done]
S4 -->|"≥70%"| DONE
S5 --> DONE

style S3 fill:#c8e6c9
style S4 fill:#fff3e0
style S5 fill:#ffcdd2
StageNameCostDescription
0CacheFREECheck if domain already classified in D1
1RulesFREEKnown domains, TLDs (.gov, .edu), subdomain services, platform patterns
1.5Google Ads CategoriesFREEUse cached DFS category data to derive tier1_type hint
2VectorizeFREESemantic similarity to known classified domains
3Low-Noise CrawlFREEHEAD + partial GET (8KB), extract <head> metadata, CMS detection
4Instant Pages$0.000125DataForSEO full page fetch (only if low-noise insufficient)
4.5Domain PatternsFREEFallback rules for placeholder/blocked pages
5LLM~$0.0001Workers AI fallback for uncertain cases

Low-Noise Crawl (Stage 3) - The key cost optimization:

  • Uses HEAD request + partial GET with Range header (first 8KB only)
  • Never executes JavaScript (avoids bot detection)
  • Extracts: title, description, canonical, robots, og:*, generator
  • Detects CMS from generator meta tag (WordPress, Shopify, Ghost, etc.)
  • Handles ~70% of domains without needing Instant Pages

Self-Learning: High-confidence LLM results are stored back to Vectorize for future queries.

URL Classification Pipeline

File: src/lib/backlink-classifier.js

5-stage pipeline for classifying URLs:

StageNameCostDescription
0Domain CacheFreeCheck cached domain classification
1RulesFreeURL patterns, TLDs, known sites
2Vectorize~$0.00001Similarity to labeled examples
3Content Parse$0.000125DataForSEO Instant Pages API
4LLM~$0.0001Workers AI fallback

Key Optimization: If domain is already classified, URL classifier skips LLM for domain-level attributes (~90% cost reduction).

What Gets Classified

When processing backlinks:

Message TypeWhat's ClassifiedDescription
classify_referring_domainSource domainThe domain that has the backlink TO you
classify_urlSource URLThe specific page with the backlink

Note: The target_domain parameter is YOUR domain (customer's site) - used only for "owned" detection.

New Taxonomy

Property Types (replaces domain_type) - ~45 types:

  • saas_product, ecommerce_store, news_publisher, blog_content_site
  • forum_community_board, ugc_platform, service_business
  • government, education, nonprofit_organization
  • And more...

Channels (8 high-level buckets):

  • search, social_networks, ugc_communities, news_media
  • pr_distribution, directories_listings, affiliate_partner, risky_gray

Tactic Categories (10 parent buckets):

  • pr, haro, link_building, affiliate, ugc
  • owned, programmatic, influencer, marketplace, blackhat

Page Type Categories (7 parent buckets):

  • editorial, commercial, ugc, programmatic
  • utility, asset, risky

Media Types (PESO model):

  • paid, earned, shared, owned

Domain Classification API

Classify Single Domain (sync):

curl -X POST https://your-worker.workers.dev/api/admin/classifier/domain \
-H "Content-Type: application/json" \
-d '{"domain": "atlassian.com"}'

Classify Single Domain (async via queue):

curl -X POST https://your-worker.workers.dev/api/admin/classifier/domain \
-H "Content-Type: application/json" \
-d '{"domain": "atlassian.com", "async": true}'

Classify Multiple Domains (async):

curl -X POST https://your-worker.workers.dev/api/admin/classifier/domains \
-H "Content-Type: application/json" \
-d '{"domains": ["atlassian.com", "hubspot.com", "salesforce.com"]}'

Get Cached Domain Classification:

curl https://your-worker.workers.dev/api/admin/classifier/domain/atlassian.com

Get Classification Stats:

curl https://your-worker.workers.dev/api/admin/classifier/domain-stats

Queue Configuration

QueueBindingPurpose
backlink-classifyBACKLINK_CLASSIFY_QUEUEURL classification
domain-classifyDOMAIN_CLASSIFY_QUEUEDomain classification

Database Tables

domains table (new columns):

property_type TEXT,
channel TEXT,
subchannel TEXT,
media_type TEXT,
domain_tech_type TEXT,
classification_source TEXT,
classification_confidence INTEGER,
classification_context TEXT,
last_classified_at INTEGER

domain_classifications (audit table):

id INTEGER PRIMARY KEY,
domain TEXT,
property_type TEXT,
channel TEXT,
subchannel TEXT,
media_type TEXT,
quality_tier TEXT,
classification_source TEXT,
classification_confidence INTEGER,
classification_context TEXT,
llm_reasoning TEXT,
created_at INTEGER

Files

FilePurpose
src/lib/classification-taxonomy.jsTaxonomy constants and helpers
src/lib/domain-classifier.jsDomain classification pipeline (7 stages)
src/lib/low-noise-crawler.jsFREE crawler using HEAD + partial GET (Stage 3)
src/lib/backlink-classifier.jsURL classification pipeline
src/lib/classifier-rules-engine.jsRules-based URL classification
src/queue/domain-classify-consumer.jsDomain classification queue consumer
src/queue/backlink-classify-consumer.jsURL classification queue consumer
src/endpoints/admin-classifier.jsAPI endpoints

Future Phases

Phase 3: Target URL Classification (Implemented)

Classifies the customer's pages that receive backlinks (the "target" URLs).

File: src/lib/target-url-classifier.js

Key difference: Target URL classification is 100% rule-based and FREE (no LLM/API costs). Since these are the customer's own pages, we don't need expensive external classification - URL patterns are sufficient.

Target Page Types (~40 types)

TypeDescription
homepageMain site homepage
product_pageProduct detail page
pricing_pagePricing/plans page
blog_postBlog article
case_studyCustomer case study
documentation_pageDocs/guides
landing_pageMarketing landing page
signup_pageRegistration page
app_pageMobile app landing
integrations_pageIntegrations directory
And more...

Target Page Categories

CategoryDescriptionExamples
commercialRevenue-driving pagesHomepage, pricing, product
editorialContent pagesBlog, news, case studies
resourceSupport/help contentDocs, FAQ, guides
documentationTechnical docsAPI docs, tutorials
utilityFunctional pagesLogin, signup, legal

Money Pages

High-value pages that drive conversions are flagged as "money pages":

  • Homepage
  • Pricing page
  • Product pages
  • Demo/trial pages
  • Signup/registration pages
  • Landing pages
  • Enterprise pages

API & Message Types

Message Types (backlink-classify queue):

// Classify both source AND target URLs for a backlink
{ type: "classify_backlink", backlink_id, source_url, source_domain, target_url, target_domain, domain_rank }

// Classify just the target URL
{ type: "classify_target_url", backlink_id, target_url, target_domain }

Database Columns (backlinks table):

tgt_page_type TEXT,              -- homepage, product_page, blog_post, etc.
tgt_page_category TEXT, -- commercial, editorial, resource, etc.
tgt_url_pattern TEXT, -- The pattern that matched
tgt_is_money_page INTEGER, -- 1 if high-value conversion page
tgt_classification_source TEXT, -- Always 'rules' (rule-based)
tgt_classification_confidence INTEGER

Database Columns (urls table):

is_money_page INTEGER DEFAULT 0,
page_category TEXT,
url_pattern TEXT

Usage

import { classifyTargetUrl } from '../lib/target-url-classifier.js';

const result = classifyTargetUrl('https://spotify.com/premium');
// Returns:
// {
// page_type: 'product_page',
// page_category: 'commercial',
// is_money_page: true,
// url_pattern: '/premium',
// classification_source: 'rules',
// classification_confidence: 90
// }

Phase 4: Brand-Level Aggregation (Not Yet Implemented)

  • Roll up backlink data by brand
  • Cross-domain brand profiles
  • Marketing DNA reports

Troubleshooting

No Data Returned

Check:

  1. Domain format (use root domain, e.g., "spotify.com" not "www.spotify.com")
  2. DataForSEO credentials configured
  3. Domain has backlinks in DataForSEO index

High Spam Score

Investigate with referring domains list:

curl "https://your-worker.workers.dev/api/admin/referring-domains/list?target=example.com&order_by=backlinks"

Missing Platform Types

DataForSEO doesn't classify all referring domains. The organization, unknown, and cms buckets are catch-alls. Our classification pipeline handles these.


Summary

FeatureStatusNotes
Domain Summaries TableCompleteWeekly snapshots
Summary API ClientComplete$0.02/request
Referring Domains APICompleteIndividual domain details
Admin EndpointsCompleteFetch, list, history
Domain Classification PipelineCompletePhase 1-2, 5-stage pipeline
URL Classification PipelineCompletePhase 1-2, 5-stage pipeline
Target URL ClassificationCompletePhase 3, rule-based (free)
Time-Series ChartsPlannedFrontend integration
OEPS ClassificationCompletePart of classification taxonomy
Brand-Level AggregationPlannedPhase 4