Backlink Intelligence System

Marketing DNA and backlink profile analysis using DataForSEO Backlinks API.

Overview

The backlink intelligence system provides:

Domain Summary Tracking - Weekly snapshots of backlink metrics ($0.02/request)
Referring Domains - Detailed per-domain backlink data ($0.02 + $0.00003/row)
Classification Pipeline - Domain/URL type classification (news, blog, affiliate, etc.)
OEPS Classification - Owned/Earned/Paid/Shared media type determination
Time-Series Analytics - Track backlink growth over time (like app rankings)

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    Backlink Intelligence                         │
├─────────────────────────────────────────────────────────────────┤
│  Weekly Cron                                                     │
│  └─> For each tracked domain:                                    │
│      └─> DataForSEO Summary ($0.02)                             │
│          └─> Store domain_summaries snapshot                     │
│                                                                  │
│  On-Demand Deep Pull                                             │
│  └─> DataForSEO Referring Domains ($0.02 + $0.00003/row)        │
│      └─> Store referring_domains (individual domain details)     │
│      └─> Queue classification jobs                               │
└─────────────────────────────────────────────────────────────────┘

Data Model

Weekly Snapshots: `domain_summaries`

Denormalized weekly snapshots (like app_category_rankings).

CREATE TABLE domain_summaries (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  target_domain TEXT NOT NULL,
  year_week INTEGER NOT NULL,        -- YYYYWW format (e.g., 202450)
  
  -- Core metrics from DataForSEO
  rank INTEGER,
  backlinks_count INTEGER,
  referring_domains_count INTEGER,
  spam_score REAL,
  
  -- Platform type counts (from DataForSEO signals)
  count_news INTEGER,
  count_blogs INTEGER,
  count_ecommerce INTEGER,
  count_forums INTEGER,
  count_social INTEGER,
  
  -- OEPS media type counts (our classification)
  count_owned INTEGER,
  count_earned INTEGER,
  count_paid INTEGER,
  count_shared INTEGER,
  
  -- Previous period tracking
  prev_period_backlinks INTEGER,
  prev_period_referring_domains INTEGER,
  prev_period_year_week INTEGER,
  
  UNIQUE(target_domain, year_week)
);

Key Points:

One row per domain per week
year_week in YYYYWW format (e.g., 202450 = week 50 of 2024)
Previous period columns enable delta calculations
Platform type counts from DataForSEO's referring_links_platform_types
OEPS counts populated by our classification pipeline

Detailed Data: `referring_domains`

Per-referring-domain details for deep analysis.

CREATE TABLE referring_domains (
  target_domain TEXT NOT NULL,
  referring_domain TEXT NOT NULL,
  
  -- DataForSEO metrics
  rank INTEGER,
  backlinks_count INTEGER,
  spam_score REAL,
  
  -- Our classification (populated by classifier)
  domain_type TEXT,      -- news, blog, ecommerce, forum, etc.
  media_type TEXT,       -- owned, earned, paid, shared
  channel_bucket TEXT,   -- pr, affiliate, community, etc.
  
  -- Timestamps
  dfs_first_seen INTEGER,
  dfs_lost_date INTEGER,
  our_first_seen INTEGER,
  our_last_seen INTEGER,
  
  UNIQUE(target_domain, referring_domain)
);

API Endpoints

Fetch Domain Summary (Weekly Snapshot)

Endpoint: POST /api/admin/referring-domains/summary

Cost: $0.02 per request

curl -X POST https://your-worker.workers.dev/api/admin/referring-domains/summary \
  -H "Content-Type: application/json" \
  -d '{"target": "spotify.com"}'

Response:

{
  "success": true,
  "target": "spotify.com",
  "year_week": 202450,
  "inserted": true,
  "cost": 0.02,
  "summary": {
    "backlinks": 45678901,
    "referring_domains": 234567,
    "rank": 89,
    "spam_score": 12.5,
    "platform_types": {
      "blogs": 45000,
      "news": 12000,
      "ecommerce": 5000,
      "message-boards": 3000,
      "organization": 150000,
      "unknown": 19567
    }
  }
}

Get Latest Summary

Endpoint: GET /api/admin/referring-domains/summary?target=spotify.com

{
  "target": "spotify.com",
  "current_week": 202450,
  "summary": {
    "year_week": 202450,
    "backlinks_count": 45678901,
    "referring_domains_count": 234567,
    "rank": 89,
    "spam_score": 12.5,
    "links_platform_types": {...},
    "prev_period_backlinks": 45000000,
    "prev_period_referring_domains": 230000
  },
  "deltas": {
    "backlinks_change": 678901,
    "backlinks_change_pct": "1.51",
    "referring_domains_change": 4567,
    "rank_change": -2
  }
}

Get Summary History

Endpoint: GET /api/admin/referring-domains/summary-history?target=spotify.com&limit=52

Returns up to 52 weeks of historical snapshots for trend analysis.

Fetch Referring Domains (Deep Pull)

Endpoint: POST /api/admin/referring-domains/fetch

Cost: $0.02 base + $0.00003 per row

curl -X POST https://your-worker.workers.dev/api/admin/referring-domains/fetch \
  -H "Content-Type: application/json" \
  -d '{
    "target": "spotify.com",
    "limit": 1000,
    "offset": 0
  }'

List Referring Domains

Endpoint: GET /api/admin/referring-domains/list?target=spotify.com&limit=100

Debug Raw API Response

Endpoint: POST /api/admin/referring-domains/debug

curl -X POST https://your-worker.workers.dev/api/admin/referring-domains/debug \
  -H "Content-Type: application/json" \
  -d '{"target": "spotify.com", "endpoint": "summary"}'

Classification Pipeline

DataForSEO Platform Types (Input Signals)

DataForSEO provides referring_links_platform_types:

Type	Quality	Notes
`news`	Good	News/media sites
`blogs`	Good	Blog platforms
`ecommerce`	Good	E-commerce sites
`message-boards`	Good	Forums
`social`	Good	Social platforms
`wikis`	Good	Wiki/reference
`educational`	Good	.edu sites
`governmental`	Good	.gov sites
`directory`	Medium	Web directories
`organization`	Poor	Catch-all bucket
`unknown`	Poor	Unclassified
`cms`	Poor	Generic CMS sites

Our Classification Layer

We use DataForSEO signals as INPUT, then apply our own classification:

DataForSEO platform_types + Google Ads Categories
         │
         ▼
┌─────────────────────────────────────────────────────────────┐
│                    FREE STAGES                               │
├─────────────────────────────────────────────────────────────┤
│  Rules Engine    ← TLD checks, URL patterns, known domains  │
│  Google Ads      ← Cached category data → tier1_type hint   │
│  Vectorize       ← Semantic similarity to known examples    │
│  Low-Noise Crawl ← HEAD + 8KB GET, CMS/og:type detection   │
└────────┬────────────────────────────────────────────────────┘
         │ (handles ~85%)
         ▼
┌─────────────────────────────────────────────────────────────┐
│                    PAID STAGES (if needed)                   │
├─────────────────────────────────────────────────────────────┤
│  Instant Pages   ← $0.000125 - Full page fetch              │
│  LLM Fallback    ← ~$0.0001 - Workers AI for ambiguous      │
└────────┬────────────────────────────────────────────────────┘
         │ (handles ~15%)
         ▼
    Final Classification (property_type + tier1_type)

Domain Types (Our Taxonomy)

Type	Description
`news`	News/media publishers
`blog`	Personal/company blogs
`ecommerce`	Online stores
`forum`	Discussion forums
`social`	Social platforms
`wiki`	Reference/wiki sites
`edu`	Educational institutions
`gov`	Government sites
`affiliate`	Affiliate/coupon sites
`directory`	Web directories
`saas`	SaaS products
`agency`	Marketing/PR agencies
`other`	Unclassified

OEPS Media Types

Type	Description	Examples
`owned`	Customer's own properties	Company blog, product pages
`earned`	Editorial coverage	News articles, reviews
`paid`	Sponsored content	Paid placements, ads
`shared`	User-generated	Forum posts, social mentions

Cost Optimization

Summary vs Referring Domains

Endpoint	Cost	Use Case
Summary	$0.02 fixed	Weekly tracking, totals
Referring Domains	$0.02 + $0.00003/row	Deep analysis, classification

Strategy:

Use Summary for weekly snapshots (cheap, gives totals + platform_types distribution)
Use Referring Domains only when you need individual domain details
For large domains (100k+ referring domains), paginate with offset

Bulk Endpoints (Coming Soon)

DataForSEO offers bulk endpoints for cost optimization:

bulk_referring_domains - Up to 1000 targets per request
bulk_ranks - Quick rank checks

Weekly Tracking Flow

// Cron trigger (weekly)
async function weeklyBacklinkTracking(env) {
  // Get tracked domains
  const domains = await env.DB.prepare(
    "SELECT DISTINCT target_domain FROM domain_summaries"
  ).all();
  
  for (const { target_domain } of domains.results) {
    // Fetch and store weekly snapshot
    await fetchAndStoreDomainSummary(target_domain, {}, env);
  }
}

Integration with Top Domains

The existing top-domains endpoint pulls from category_domain_metrics (organic SEO data).

Future Enhancement: Add backlink metrics to domain profiles:

-- Join domain_summaries with category_domain_metrics
SELECT 
  cdm.domain,
  cdm.organic_etv,
  ds.backlinks_count,
  ds.referring_domains_count,
  ds.spam_score
FROM category_domain_metrics cdm
LEFT JOIN domain_summaries ds ON cdm.domain = ds.target_domain
WHERE ds.year_week = (SELECT MAX(year_week) FROM domain_summaries WHERE target_domain = cdm.domain)

Classification System (Implemented)

The backlink classification system uses a two-tier approach:

Domain Classification - Classify the domain once, cache it, reuse for all URLs
URL Classification - Classify individual URLs, inheriting domain-level attributes

Domain Classification Pipeline

File: src/lib/domain-classifier.js

7-stage cost-optimized pipeline for classifying domains (FREE stages first, PAID only when needed):

flowchart LR
    subgraph FREE["FREE Stages"]
        S0[Cache] --> S1[Rules]
        S1 --> S1_5[Google Ads<br/>Categories]
        S1_5 --> S2[Vectorize]
        S2 --> S3[Low-Noise<br/>Crawl]
    end
    
    subgraph PAID["PAID (only if needed)"]
        S3 --> S4[Instant Pages]
        S4 --> S4_5[Domain Patterns]
        S4_5 --> S5[LLM]
    end
    
    S3 -->|"≥70%"| DONE[Done]
    S4 -->|"≥70%"| DONE
    S5 --> DONE
    
    style S3 fill:#c8e6c9
    style S4 fill:#fff3e0
    style S5 fill:#ffcdd2

Stage	Name	Cost	Description
0	Cache	FREE	Check if domain already classified in D1
1	Rules	FREE	Known domains, TLDs (.gov, .edu), subdomain services, platform patterns
1.5	Google Ads Categories	FREE	Use cached DFS category data to derive tier1_type hint
2	Vectorize	FREE	Semantic similarity to known classified domains
3	Low-Noise Crawl	FREE	HEAD + partial GET (8KB), extract `<head>` metadata, CMS detection
4	Instant Pages	$0.000125	DataForSEO full page fetch (only if low-noise insufficient)
4.5	Domain Patterns	FREE	Fallback rules for placeholder/blocked pages
5	LLM	~$0.0001	Workers AI fallback for uncertain cases

Low-Noise Crawl (Stage 3) - The key cost optimization:

Uses HEAD request + partial GET with Range header (first 8KB only)
Never executes JavaScript (avoids bot detection)
Extracts: title, description, canonical, robots, og:*, generator
Detects CMS from generator meta tag (WordPress, Shopify, Ghost, etc.)
Handles ~70% of domains without needing Instant Pages

Self-Learning: High-confidence LLM results are stored back to Vectorize for future queries.

URL Classification Pipeline

File: src/lib/backlink-classifier.js

5-stage pipeline for classifying URLs:

Stage	Name	Cost	Description
0	Domain Cache	Free	Check cached domain classification
1	Rules	Free	URL patterns, TLDs, known sites
2	Vectorize	~$0.00001	Similarity to labeled examples
3	Content Parse	$0.000125	DataForSEO Instant Pages API
4	LLM	~$0.0001	Workers AI fallback

Key Optimization: If domain is already classified, URL classifier skips LLM for domain-level attributes (~90% cost reduction).

What Gets Classified

When processing backlinks:

Message Type	What's Classified	Description
`classify_referring_domain`	Source domain	The domain that has the backlink TO you
`classify_url`	Source URL	The specific page with the backlink

Note: The target_domain parameter is YOUR domain (customer's site) - used only for "owned" detection.

New Taxonomy

Property Types (replaces domain_type) - ~45 types:

saas_product, ecommerce_store, news_publisher, blog_content_site
forum_community_board, ugc_platform, service_business
government, education, nonprofit_organization
And more...

Channels (8 high-level buckets):

search, social_networks, ugc_communities, news_media
pr_distribution, directories_listings, affiliate_partner, risky_gray

Tactic Categories (10 parent buckets):

pr, haro, link_building, affiliate, ugc
owned, programmatic, influencer, marketplace, blackhat

Page Type Categories (7 parent buckets):

editorial, commercial, ugc, programmatic
utility, asset, risky

Media Types (PESO model):

paid, earned, shared, owned

Domain Classification API

Classify Single Domain (sync):

curl -X POST https://your-worker.workers.dev/api/admin/classifier/domain \
  -H "Content-Type: application/json" \
  -d '{"domain": "atlassian.com"}'

Classify Single Domain (async via queue):

curl -X POST https://your-worker.workers.dev/api/admin/classifier/domain \
  -H "Content-Type: application/json" \
  -d '{"domain": "atlassian.com", "async": true}'

Classify Multiple Domains (async):

curl -X POST https://your-worker.workers.dev/api/admin/classifier/domains \
  -H "Content-Type: application/json" \
  -d '{"domains": ["atlassian.com", "hubspot.com", "salesforce.com"]}'

Get Cached Domain Classification:

curl https://your-worker.workers.dev/api/admin/classifier/domain/atlassian.com

Get Classification Stats:

curl https://your-worker.workers.dev/api/admin/classifier/domain-stats

Queue Configuration

Queue	Binding	Purpose
`backlink-classify`	BACKLINK_CLASSIFY_QUEUE	URL classification
`domain-classify`	DOMAIN_CLASSIFY_QUEUE	Domain classification

Database Tables

domains table (new columns):

property_type TEXT,
channel TEXT,
subchannel TEXT,
media_type TEXT,
domain_tech_type TEXT,
classification_source TEXT,
classification_confidence INTEGER,
classification_context TEXT,
last_classified_at INTEGER

domain_classifications (audit table):

id INTEGER PRIMARY KEY,
domain TEXT,
property_type TEXT,
channel TEXT,
subchannel TEXT,
media_type TEXT,
quality_tier TEXT,
classification_source TEXT,
classification_confidence INTEGER,
classification_context TEXT,
llm_reasoning TEXT,
created_at INTEGER

Files

File	Purpose
`src/lib/classification-taxonomy.js`	Taxonomy constants and helpers
`src/lib/domain-classifier.js`	Domain classification pipeline (7 stages)
`src/lib/low-noise-crawler.js`	FREE crawler using HEAD + partial GET (Stage 3)
`src/lib/backlink-classifier.js`	URL classification pipeline
`src/lib/classifier-rules-engine.js`	Rules-based URL classification
`src/queue/domain-classify-consumer.js`	Domain classification queue consumer
`src/queue/backlink-classify-consumer.js`	URL classification queue consumer
`src/endpoints/admin-classifier.js`	API endpoints

Future Phases

Phase 3: Target URL Classification (Implemented)

Classifies the customer's pages that receive backlinks (the "target" URLs).

File: src/lib/target-url-classifier.js

Key difference: Target URL classification is 100% rule-based and FREE (no LLM/API costs). Since these are the customer's own pages, we don't need expensive external classification - URL patterns are sufficient.

Target Page Types (~40 types)

Type	Description
`homepage`	Main site homepage
`product_page`	Product detail page
`pricing_page`	Pricing/plans page
`blog_post`	Blog article
`case_study`	Customer case study
`documentation_page`	Docs/guides
`landing_page`	Marketing landing page
`signup_page`	Registration page
`app_page`	Mobile app landing
`integrations_page`	Integrations directory
And more...

Target Page Categories

Category	Description	Examples
`commercial`	Revenue-driving pages	Homepage, pricing, product
`editorial`	Content pages	Blog, news, case studies
`resource`	Support/help content	Docs, FAQ, guides
`documentation`	Technical docs	API docs, tutorials
`utility`	Functional pages	Login, signup, legal

Money Pages

High-value pages that drive conversions are flagged as "money pages":

Homepage
Pricing page
Product pages
Demo/trial pages
Signup/registration pages
Landing pages
Enterprise pages

API & Message Types

Message Types (backlink-classify queue):

// Classify both source AND target URLs for a backlink
{ type: "classify_backlink", backlink_id, source_url, source_domain, target_url, target_domain, domain_rank }

// Classify just the target URL
{ type: "classify_target_url", backlink_id, target_url, target_domain }

Database Columns (backlinks table):

tgt_page_type TEXT,              -- homepage, product_page, blog_post, etc.
tgt_page_category TEXT,          -- commercial, editorial, resource, etc.
tgt_url_pattern TEXT,            -- The pattern that matched
tgt_is_money_page INTEGER,       -- 1 if high-value conversion page
tgt_classification_source TEXT,  -- Always 'rules' (rule-based)
tgt_classification_confidence INTEGER

Database Columns (urls table):

is_money_page INTEGER DEFAULT 0,
page_category TEXT,
url_pattern TEXT

Usage

import { classifyTargetUrl } from '../lib/target-url-classifier.js';

const result = classifyTargetUrl('https://spotify.com/premium');
// Returns:
// {
//   page_type: 'product_page',
//   page_category: 'commercial',
//   is_money_page: true,
//   url_pattern: '/premium',
//   classification_source: 'rules',
//   classification_confidence: 90
// }

Phase 4: Brand-Level Aggregation (Not Yet Implemented)

Roll up backlink data by brand
Cross-domain brand profiles
Marketing DNA reports

Troubleshooting

No Data Returned

Check:

Domain format (use root domain, e.g., "spotify.com" not "www.spotify.com")
DataForSEO credentials configured
Domain has backlinks in DataForSEO index

High Spam Score

Investigate with referring domains list:

curl "https://your-worker.workers.dev/api/admin/referring-domains/list?target=example.com&order_by=backlinks"

Missing Platform Types

DataForSEO doesn't classify all referring domains. The organization, unknown, and cms buckets are catch-alls. Our classification pipeline handles these.

Summary

Feature	Status	Notes
Domain Summaries Table	Complete	Weekly snapshots
Summary API Client	Complete	$0.02/request
Referring Domains API	Complete	Individual domain details
Admin Endpoints	Complete	Fetch, list, history
Domain Classification Pipeline	Complete	Phase 1-2, 5-stage pipeline
URL Classification Pipeline	Complete	Phase 1-2, 5-stage pipeline
Target URL Classification	Complete	Phase 3, rule-based (free)
Time-Series Charts	Planned	Frontend integration
OEPS Classification	Complete	Part of classification taxonomy
Brand-Level Aggregation	Planned	Phase 4

Overview​

Architecture​

Data Model​

Weekly Snapshots: domain_summaries​

Detailed Data: referring_domains​

API Endpoints​

Fetch Domain Summary (Weekly Snapshot)​

Get Latest Summary​

Get Summary History​

Fetch Referring Domains (Deep Pull)​

List Referring Domains​

Debug Raw API Response​

Classification Pipeline​

DataForSEO Platform Types (Input Signals)​

Our Classification Layer​

Domain Types (Our Taxonomy)​

OEPS Media Types​

Cost Optimization​

Summary vs Referring Domains​

Bulk Endpoints (Coming Soon)​

Weekly Tracking Flow​

Integration with Top Domains​

Classification System (Implemented)​

Domain Classification Pipeline​

URL Classification Pipeline​

What Gets Classified​

New Taxonomy​

Domain Classification API​

Queue Configuration​

Database Tables​

Files​

Future Phases​

Phase 3: Target URL Classification (Implemented)​

Target Page Types (~40 types)​

Target Page Categories​

Money Pages​

API & Message Types​

Usage​

Phase 4: Brand-Level Aggregation (Not Yet Implemented)​

Troubleshooting​

No Data Returned​

High Spam Score​

Missing Platform Types​

Summary​