Backlink Intelligence System
Marketing DNA and backlink profile analysis using DataForSEO Backlinks API.
Overview
The backlink intelligence system provides:
- Domain Summary Tracking - Weekly snapshots of backlink metrics ($0.02/request)
- Referring Domains - Detailed per-domain backlink data ($0.02 + $0.00003/row)
- Classification Pipeline - Domain/URL type classification (news, blog, affiliate, etc.)
- OEPS Classification - Owned/Earned/Paid/Shared media type determination
- Time-Series Analytics - Track backlink growth over time (like app rankings)
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Backlink Intelligence │
├─────────────────────────────────────────────────────────────────┤
│ Weekly Cron │
│ └─> For each tracked domain: │
│ └─> DataForSEO Summary ($0.02) │
│ └─> Store domain_summaries snapshot │
│ │
│ On-Demand Deep Pull │
│ └─> DataForSEO Referring Domains ($0.02 + $0.00003/row) │
│ └─> Store referring_domains (individual domain details) │
│ └─> Queue classification jobs │
└─────────────────────────────────────────────────────────────────┘
Data Model
Weekly Snapshots: domain_summaries
Denormalized weekly snapshots (like app_category_rankings).
CREATE TABLE domain_summaries (
id INTEGER PRIMARY KEY AUTOINCREMENT,
target_domain TEXT NOT NULL,
year_week INTEGER NOT NULL, -- YYYYWW format (e.g., 202450)
-- Core metrics from DataForSEO
rank INTEGER,
backlinks_count INTEGER,
referring_domains_count INTEGER,
spam_score REAL,
-- Platform type counts (from DataForSEO signals)
count_news INTEGER,
count_blogs INTEGER,
count_ecommerce INTEGER,
count_forums INTEGER,
count_social INTEGER,
-- OEPS media type counts (our classification)
count_owned INTEGER,
count_earned INTEGER,
count_paid INTEGER,
count_shared INTEGER,
-- Previous period tracking
prev_period_backlinks INTEGER,
prev_period_referring_domains INTEGER,
prev_period_year_week INTEGER,
UNIQUE(target_domain, year_week)
);
Key Points:
- One row per domain per week
year_weekin YYYYWW format (e.g., 202450 = week 50 of 2024)- Previous period columns enable delta calculations
- Platform type counts from DataForSEO's
referring_links_platform_types - OEPS counts populated by our classification pipeline
Detailed Data: referring_domains
Per-referring-domain details for deep analysis.
CREATE TABLE referring_domains (
target_domain TEXT NOT NULL,
referring_domain TEXT NOT NULL,
-- DataForSEO metrics
rank INTEGER,
backlinks_count INTEGER,
spam_score REAL,
-- Our classification (populated by classifier)
domain_type TEXT, -- news, blog, ecommerce, forum, etc.
media_type TEXT, -- owned, earned, paid, shared
channel_bucket TEXT, -- pr, affiliate, community, etc.
-- Timestamps
dfs_first_seen INTEGER,
dfs_lost_date INTEGER,
our_first_seen INTEGER,
our_last_seen INTEGER,
UNIQUE(target_domain, referring_domain)
);
API Endpoints
Fetch Domain Summary (Weekly Snapshot)
Endpoint: POST /api/admin/referring-domains/summary
Cost: $0.02 per request
curl -X POST https://your-worker.workers.dev/api/admin/referring-domains/summary \
-H "Content-Type: application/json" \
-d '{"target": "spotify.com"}'
Response:
{
"success": true,
"target": "spotify.com",
"year_week": 202450,
"inserted": true,
"cost": 0.02,
"summary": {
"backlinks": 45678901,
"referring_domains": 234567,
"rank": 89,
"spam_score": 12.5,
"platform_types": {
"blogs": 45000,
"news": 12000,
"ecommerce": 5000,
"message-boards": 3000,
"organization": 150000,
"unknown": 19567
}
}
}
Get Latest Summary
Endpoint: GET /api/admin/referring-domains/summary?target=spotify.com
{
"target": "spotify.com",
"current_week": 202450,
"summary": {
"year_week": 202450,
"backlinks_count": 45678901,
"referring_domains_count": 234567,
"rank": 89,
"spam_score": 12.5,
"links_platform_types": {...},
"prev_period_backlinks": 45000000,
"prev_period_referring_domains": 230000
},
"deltas": {
"backlinks_change": 678901,
"backlinks_change_pct": "1.51",
"referring_domains_change": 4567,
"rank_change": -2
}
}
Get Summary History
Endpoint: GET /api/admin/referring-domains/summary-history?target=spotify.com&limit=52
Returns up to 52 weeks of historical snapshots for trend analysis.
Fetch Referring Domains (Deep Pull)
Endpoint: POST /api/admin/referring-domains/fetch
Cost: $0.02 base + $0.00003 per row
curl -X POST https://your-worker.workers.dev/api/admin/referring-domains/fetch \
-H "Content-Type: application/json" \
-d '{
"target": "spotify.com",
"limit": 1000,
"offset": 0
}'
List Referring Domains
Endpoint: GET /api/admin/referring-domains/list?target=spotify.com&limit=100
Debug Raw API Response
Endpoint: POST /api/admin/referring-domains/debug
curl -X POST https://your-worker.workers.dev/api/admin/referring-domains/debug \
-H "Content-Type: application/json" \
-d '{"target": "spotify.com", "endpoint": "summary"}'
Classification Pipeline
DataForSEO Platform Types (Input Signals)
DataForSEO provides referring_links_platform_types:
| Type | Quality | Notes |
|---|---|---|
news | Good | News/media sites |
blogs | Good | Blog platforms |
ecommerce | Good | E-commerce sites |
message-boards | Good | Forums |
social | Good | Social platforms |
wikis | Good | Wiki/reference |
educational | Good | .edu sites |
governmental | Good | .gov sites |
directory | Medium | Web directories |
organization | Poor | Catch-all bucket |
unknown | Poor | Unclassified |
cms | Poor | Generic CMS sites |
Our Classification Layer
We use DataForSEO signals as INPUT, then apply our own classification:
DataForSEO platform_types + Google Ads Categories
│
▼
┌─────────────────────────────────────────────────────────────┐
│ FREE STAGES │
├─────────────────────────────────────────────────────────────┤
│ Rules Engine ← TLD checks, URL patterns, known domains │
│ Google Ads ← Cached category data → tier1_type hint │
│ Vectorize ← Semantic similarity to known examples │
│ Low-Noise Crawl ← HEAD + 8KB GET, CMS/og:type detection │
└────────┬────────────────────────────────────────────────────┘
│ (handles ~85%)
▼
┌─────────────────────────────────────────────────────────────┐
│ PAID STAGES (if needed) │
├─────────────────────────────────────────────────────────────┤
│ Instant Pages ← $0.000125 - Full page fetch │
│ LLM Fallback ← ~$0.0001 - Workers AI for ambiguous │
└────────┬────────────────────────────────────────────────────┘
│ (handles ~15%)
▼
Final Classification (property_type + tier1_type)
Domain Types (Our Taxonomy)
| Type | Description |
|---|---|
news | News/media publishers |
blog | Personal/company blogs |
ecommerce | Online stores |
forum | Discussion forums |
social | Social platforms |
wiki | Reference/wiki sites |
edu | Educational institutions |
gov | Government sites |
affiliate | Affiliate/coupon sites |
directory | Web directories |
saas | SaaS products |
agency | Marketing/PR agencies |
other | Unclassified |
OEPS Media Types
| Type | Description | Examples |
|---|---|---|
owned | Customer's own properties | Company blog, product pages |
earned | Editorial coverage | News articles, reviews |
paid | Sponsored content | Paid placements, ads |
shared | User-generated | Forum posts, social mentions |
Cost Optimization
Summary vs Referring Domains
| Endpoint | Cost | Use Case |
|---|---|---|
| Summary | $0.02 fixed | Weekly tracking, totals |
| Referring Domains | $0.02 + $0.00003/row | Deep analysis, classification |
Strategy:
- Use Summary for weekly snapshots (cheap, gives totals + platform_types distribution)
- Use Referring Domains only when you need individual domain details
- For large domains (100k+ referring domains), paginate with offset
Bulk Endpoints (Coming Soon)
DataForSEO offers bulk endpoints for cost optimization:
bulk_referring_domains- Up to 1000 targets per requestbulk_ranks- Quick rank checks
Weekly Tracking Flow
// Cron trigger (weekly)
async function weeklyBacklinkTracking(env) {
// Get tracked domains
const domains = await env.DB.prepare(
"SELECT DISTINCT target_domain FROM domain_summaries"
).all();
for (const { target_domain } of domains.results) {
// Fetch and store weekly snapshot
await fetchAndStoreDomainSummary(target_domain, {}, env);
}
}
Integration with Top Domains
The existing top-domains endpoint pulls from category_domain_metrics (organic SEO data).
Future Enhancement: Add backlink metrics to domain profiles:
-- Join domain_summaries with category_domain_metrics
SELECT
cdm.domain,
cdm.organic_etv,
ds.backlinks_count,
ds.referring_domains_count,
ds.spam_score
FROM category_domain_metrics cdm
LEFT JOIN domain_summaries ds ON cdm.domain = ds.target_domain
WHERE ds.year_week = (SELECT MAX(year_week) FROM domain_summaries WHERE target_domain = cdm.domain)
Classification System (Implemented)
The backlink classification system uses a two-tier approach:
- Domain Classification - Classify the domain once, cache it, reuse for all URLs
- URL Classification - Classify individual URLs, inheriting domain-level attributes
Domain Classification Pipeline
File: src/lib/domain-classifier.js
7-stage cost-optimized pipeline for classifying domains (FREE stages first, PAID only when needed):
flowchart LR
subgraph FREE["FREE Stages"]
S0[Cache] --> S1[Rules]
S1 --> S1_5[Google Ads<br/>Categories]
S1_5 --> S2[Vectorize]
S2 --> S3[Low-Noise<br/>Crawl]
end
subgraph PAID["PAID (only if needed)"]
S3 --> S4[Instant Pages]
S4 --> S4_5[Domain Patterns]
S4_5 --> S5[LLM]
end
S3 -->|"≥70%"| DONE[Done]
S4 -->|"≥70%"| DONE
S5 --> DONE
style S3 fill:#c8e6c9
style S4 fill:#fff3e0
style S5 fill:#ffcdd2
| Stage | Name | Cost | Description |
|---|---|---|---|
| 0 | Cache | FREE | Check if domain already classified in D1 |
| 1 | Rules | FREE | Known domains, TLDs (.gov, .edu), subdomain services, platform patterns |
| 1.5 | Google Ads Categories | FREE | Use cached DFS category data to derive tier1_type hint |
| 2 | Vectorize | FREE | Semantic similarity to known classified domains |
| 3 | Low-Noise Crawl | FREE | HEAD + partial GET (8KB), extract <head> metadata, CMS detection |
| 4 | Instant Pages | $0.000125 | DataForSEO full page fetch (only if low-noise insufficient) |
| 4.5 | Domain Patterns | FREE | Fallback rules for placeholder/blocked pages |
| 5 | LLM | ~$0.0001 | Workers AI fallback for uncertain cases |
Low-Noise Crawl (Stage 3) - The key cost optimization:
- Uses HEAD request + partial GET with Range header (first 8KB only)
- Never executes JavaScript (avoids bot detection)
- Extracts: title, description, canonical, robots, og:*, generator
- Detects CMS from generator meta tag (WordPress, Shopify, Ghost, etc.)
- Handles ~70% of domains without needing Instant Pages
Self-Learning: High-confidence LLM results are stored back to Vectorize for future queries.
URL Classification Pipeline
File: src/lib/backlink-classifier.js
5-stage pipeline for classifying URLs:
| Stage | Name | Cost | Description |
|---|---|---|---|
| 0 | Domain Cache | Free | Check cached domain classification |
| 1 | Rules | Free | URL patterns, TLDs, known sites |
| 2 | Vectorize | ~$0.00001 | Similarity to labeled examples |
| 3 | Content Parse | $0.000125 | DataForSEO Instant Pages API |
| 4 | LLM | ~$0.0001 | Workers AI fallback |
Key Optimization: If domain is already classified, URL classifier skips LLM for domain-level attributes (~90% cost reduction).
What Gets Classified
When processing backlinks:
| Message Type | What's Classified | Description |
|---|---|---|
classify_referring_domain | Source domain | The domain that has the backlink TO you |
classify_url | Source URL | The specific page with the backlink |
Note: The target_domain parameter is YOUR domain (customer's site) - used only for "owned" detection.
New Taxonomy
Property Types (replaces domain_type) - ~45 types:
saas_product,ecommerce_store,news_publisher,blog_content_siteforum_community_board,ugc_platform,service_businessgovernment,education,nonprofit_organization- And more...
Channels (8 high-level buckets):
search,social_networks,ugc_communities,news_mediapr_distribution,directories_listings,affiliate_partner,risky_gray
Tactic Categories (10 parent buckets):
pr,haro,link_building,affiliate,ugcowned,programmatic,influencer,marketplace,blackhat
Page Type Categories (7 parent buckets):
editorial,commercial,ugc,programmaticutility,asset,risky
Media Types (PESO model):
paid,earned,shared,owned
Domain Classification API
Classify Single Domain (sync):
curl -X POST https://your-worker.workers.dev/api/admin/classifier/domain \
-H "Content-Type: application/json" \
-d '{"domain": "atlassian.com"}'
Classify Single Domain (async via queue):
curl -X POST https://your-worker.workers.dev/api/admin/classifier/domain \
-H "Content-Type: application/json" \
-d '{"domain": "atlassian.com", "async": true}'
Classify Multiple Domains (async):
curl -X POST https://your-worker.workers.dev/api/admin/classifier/domains \
-H "Content-Type: application/json" \
-d '{"domains": ["atlassian.com", "hubspot.com", "salesforce.com"]}'
Get Cached Domain Classification:
curl https://your-worker.workers.dev/api/admin/classifier/domain/atlassian.com
Get Classification Stats:
curl https://your-worker.workers.dev/api/admin/classifier/domain-stats
Queue Configuration
| Queue | Binding | Purpose |
|---|---|---|
backlink-classify | BACKLINK_CLASSIFY_QUEUE | URL classification |
domain-classify | DOMAIN_CLASSIFY_QUEUE | Domain classification |
Database Tables
domains table (new columns):
property_type TEXT,
channel TEXT,
subchannel TEXT,
media_type TEXT,
domain_tech_type TEXT,
classification_source TEXT,
classification_confidence INTEGER,
classification_context TEXT,
last_classified_at INTEGER
domain_classifications (audit table):
id INTEGER PRIMARY KEY,
domain TEXT,
property_type TEXT,
channel TEXT,
subchannel TEXT,
media_type TEXT,
quality_tier TEXT,
classification_source TEXT,
classification_confidence INTEGER,
classification_context TEXT,
llm_reasoning TEXT,
created_at INTEGER
Files
| File | Purpose |
|---|---|
src/lib/classification-taxonomy.js | Taxonomy constants and helpers |
src/lib/domain-classifier.js | Domain classification pipeline (7 stages) |
src/lib/low-noise-crawler.js | FREE crawler using HEAD + partial GET (Stage 3) |
src/lib/backlink-classifier.js | URL classification pipeline |
src/lib/classifier-rules-engine.js | Rules-based URL classification |
src/queue/domain-classify-consumer.js | Domain classification queue consumer |
src/queue/backlink-classify-consumer.js | URL classification queue consumer |
src/endpoints/admin-classifier.js | API endpoints |
Future Phases
Phase 3: Target URL Classification (Implemented)
Classifies the customer's pages that receive backlinks (the "target" URLs).
File: src/lib/target-url-classifier.js
Key difference: Target URL classification is 100% rule-based and FREE (no LLM/API costs). Since these are the customer's own pages, we don't need expensive external classification - URL patterns are sufficient.
Target Page Types (~40 types)
| Type | Description |
|---|---|
homepage | Main site homepage |
product_page | Product detail page |
pricing_page | Pricing/plans page |
blog_post | Blog article |
case_study | Customer case study |
documentation_page | Docs/guides |
landing_page | Marketing landing page |
signup_page | Registration page |
app_page | Mobile app landing |
integrations_page | Integrations directory |
| And more... |
Target Page Categories
| Category | Description | Examples |
|---|---|---|
commercial | Revenue-driving pages | Homepage, pricing, product |
editorial | Content pages | Blog, news, case studies |
resource | Support/help content | Docs, FAQ, guides |
documentation | Technical docs | API docs, tutorials |
utility | Functional pages | Login, signup, legal |
Money Pages
High-value pages that drive conversions are flagged as "money pages":
- Homepage
- Pricing page
- Product pages
- Demo/trial pages
- Signup/registration pages
- Landing pages
- Enterprise pages
API & Message Types
Message Types (backlink-classify queue):
// Classify both source AND target URLs for a backlink
{ type: "classify_backlink", backlink_id, source_url, source_domain, target_url, target_domain, domain_rank }
// Classify just the target URL
{ type: "classify_target_url", backlink_id, target_url, target_domain }
Database Columns (backlinks table):
tgt_page_type TEXT, -- homepage, product_page, blog_post, etc.
tgt_page_category TEXT, -- commercial, editorial, resource, etc.
tgt_url_pattern TEXT, -- The pattern that matched
tgt_is_money_page INTEGER, -- 1 if high-value conversion page
tgt_classification_source TEXT, -- Always 'rules' (rule-based)
tgt_classification_confidence INTEGER
Database Columns (urls table):
is_money_page INTEGER DEFAULT 0,
page_category TEXT,
url_pattern TEXT
Usage
import { classifyTargetUrl } from '../lib/target-url-classifier.js';
const result = classifyTargetUrl('https://spotify.com/premium');
// Returns:
// {
// page_type: 'product_page',
// page_category: 'commercial',
// is_money_page: true,
// url_pattern: '/premium',
// classification_source: 'rules',
// classification_confidence: 90
// }
Phase 4: Brand-Level Aggregation (Not Yet Implemented)
- Roll up backlink data by brand
- Cross-domain brand profiles
- Marketing DNA reports
Troubleshooting
No Data Returned
Check:
- Domain format (use root domain, e.g., "spotify.com" not "www.spotify.com")
- DataForSEO credentials configured
- Domain has backlinks in DataForSEO index
High Spam Score
Investigate with referring domains list:
curl "https://your-worker.workers.dev/api/admin/referring-domains/list?target=example.com&order_by=backlinks"
Missing Platform Types
DataForSEO doesn't classify all referring domains. The organization, unknown, and cms buckets are catch-alls. Our classification pipeline handles these.
Summary
| Feature | Status | Notes |
|---|---|---|
| Domain Summaries Table | Complete | Weekly snapshots |
| Summary API Client | Complete | $0.02/request |
| Referring Domains API | Complete | Individual domain details |
| Admin Endpoints | Complete | Fetch, list, history |
| Domain Classification Pipeline | Complete | Phase 1-2, 5-stage pipeline |
| URL Classification Pipeline | Complete | Phase 1-2, 5-stage pipeline |
| Target URL Classification | Complete | Phase 3, rule-based (free) |
| Time-Series Charts | Planned | Frontend integration |
| OEPS Classification | Complete | Part of classification taxonomy |
| Brand-Level Aggregation | Planned | Phase 4 |