Domain Onboarding Flow
Mermaid Diagram
flowchart TD
subgraph AppCrawl["App Store Crawl (Entry Point)"]
APP[Crawl App Store] --> SCRAPE[HTML Scrape<br/>FREE - Desktop Safari UA]
SCRAPE --> ITUNES[iTunes API<br/>FREE - authoritative metadata]
ITUNES --> APP_DETAILS[app-details-consumer]
APP_DETAILS --> BRAND_CREATE[ensureBrand<br/>name from developer_name]
APP_DETAILS --> DOMAIN_CREATE[ensureDomain<br/>from developer_url]
APP_DETAILS --> ICON[Upload icon<br/>to CF Images]
APP_DETAILS --> SIMILAR{crawl_depth<br/>> 0?}
SIMILAR -->|Yes| QUEUE_SIMILAR[Queue similar apps<br/>with crawl_depth=0]
SIMILAR -->|No| SKIP_SIMILAR[Skip similar apps]
end
subgraph DomainTrigger["Domain Onboarding Trigger"]
DOMAIN_CREATE --> TRIGGER{triggerDomainOnboarding<br/>called?}
TRIGGER -->|Yes| SET_STATUS[Set onboard_status=pending]
TRIGGER -->|No| MANUAL[Wait for manual trigger]
SET_STATUS --> QUEUE_JOBS[Queue 3 jobs to<br/>domain-onboard queue]
end
subgraph DomainOnboard["domain-onboard-consumer (3 Parallel Jobs)"]
QUEUE_JOBS --> JOB1[fetch_keywords<br/>$0.03 - limit 100]
QUEUE_JOBS --> JOB2[fetch_summary<br/>$0.02]
QUEUE_JOBS --> JOB3[fetch_backlinks<br/>$0.04 - limit 50]
JOB1 --> KW_STORE[(domain_keyword_rankings)]
JOB1 --> BRAND_UPDATE[Update brand.name<br/>from website_name<br/>if placeholder]
JOB2 --> SUMMARY_STORE[(domain_summaries<br/>domain_rank, backlink_count)]
JOB3 --> BL_STORE[(backlinks)]
JOB3 --> BL_QUEUE[Queue URLs to<br/>backlink-classify]
KW_STORE --> CHECK_COMPLETE{All 3 jobs<br/>complete?}
SUMMARY_STORE --> CHECK_COMPLETE
BL_STORE --> CHECK_COMPLETE
CHECK_COMPLETE -->|Yes| COMPLETE[onboard_status=complete]
CHECK_COMPLETE -->|No| WAIT[Wait for other jobs]
end
subgraph BacklinkClassify["backlink-classify-consumer"]
BL_QUEUE --> URL_CLASS[Classify each URL<br/>Rules → Vectorize → LLM]
URL_CLASS --> URL_STORE[Update backlinks:<br/>page_type, tactic_type,<br/>quality_tier]
end
subgraph Classification["Domain Classification (Separate API Call)"]
CLASSIFY_API[POST /admin/classifier/domain] --> CLASSIFIER[classifyDomain]
CLASSIFIER --> C0{Cache hit?}
C0 -->|Yes| C_DONE[Return cached]
C0 -->|No| C1[Stage 1: Rules<br/>FREE]
C1 --> C1_5[Stage 1.5: Google Ads Categories<br/>FREE]
C1_5 --> C2[Stage 2: Vectorize<br/>FREE]
C2 --> C3[Stage 3: Low-Noise Crawl<br/>FREE]
C3 --> C3_CONF{≥70%?}
C3_CONF -->|Yes| C_DONE
C3_CONF -->|No| C4[Stage 4: Instant Pages<br/>$0.000125]
C4 --> C4_5[Stage 4.5: Domain Patterns<br/>FREE]
C4_5 --> C5{Still uncertain?}
C5 -->|Yes| C6[Stage 5: LLM<br/>~$0.0001]
C5 -->|No| C_DONE
C6 --> C_DONE[Store & Learn]
end
subgraph OtherTriggers["Other Domain Entry Points"]
A2[Backlink Discovery] --> DOMAIN_CREATE2[ensureDomain]
A3[Manual API Add] --> DOMAIN_CREATE2
A4[SERP Crawl] --> DOMAIN_CREATE2
DOMAIN_CREATE2 --> TRIGGER
end
style SCRAPE fill:#c8e6c9
style ITUNES fill:#c8e6c9
style JOB1 fill:#fff3e0
style JOB2 fill:#fff3e0
style JOB3 fill:#fff3e0
style C3 fill:#c8e6c9
style C4 fill:#fff3e0
style C6 fill:#ffcdd2
style COMPLETE fill:#c8e6c9
Key Changes from Previous Version:
- App scraping: HTML scrape is FREE and runs FIRST, iTunes API second
- Domain onboard queues 3 PARALLEL jobs, not sequential
- Classification is a SEPARATE API call, not automatic
- Similar apps use crawl_depth=0 to prevent infinite recursion
- Limits reduced: 100 keywords, 50 backlinks (D1 rate limits)
Domain Classification Pipeline
The classifier uses a cost-optimized pipeline - FREE stages run first, PAID stages only when needed:
flowchart LR
subgraph FREE["FREE Stages"]
S0[Cache Check] --> S1[Rules Engine]
S1 --> S1_5[Google Ads<br/>Categories]
S1_5 --> S2[Vectorize]
S2 --> S3[Low-Noise Crawl]
end
subgraph PAID["PAID Stages (only if needed)"]
S3 --> S4[Instant Pages<br/>$0.000125]
S4 --> S5[LLM Fallback<br/>~$0.0001]
end
S3 -->|"≥70% confidence"| DONE[Done]
S4 -->|"≥70% confidence"| DONE
S5 --> DONE
style S0 fill:#e8f5e9
style S1 fill:#e8f5e9
style S1_5 fill:#e8f5e9
style S2 fill:#e8f5e9
style S3 fill:#c8e6c9
style S4 fill:#fff3e0
style S5 fill:#ffcdd2
Stage Details
| Stage | Name | Cost | What It Does |
|---|---|---|---|
| 0 | Cache | FREE | Check if domain already classified in D1 |
| 1 | Rules | FREE | Known domains, TLDs (.gov/.edu), subdomain services, platform patterns |
| 1.5 | Google Ads Categories | FREE | Use cached DFS category data to derive tier1_type |
| 2 | Vectorize | FREE | Semantic similarity to known classified domains |
| 3 | Low-Noise Crawl | FREE | HEAD + partial GET (8KB), extract <head> metadata |
| 4 | Instant Pages | $0.000125 | DataForSEO full page fetch (only if low-noise insufficient) |
| 4.5 | Domain Patterns | FREE | Fallback rules for placeholder pages |
| 5 | LLM | ~$0.0001 | Workers AI for ambiguous cases |
Low-Noise Crawl (Stage 3)
The low-noise crawler is a FREE alternative to DataForSEO Instant Pages:
Phase 1: DNS Resolution
└─> Check if domain resolves (root vs www)
└─> Determine canonical host
Phase 2: HEAD Request
└─> Follow redirects
└─> Capture server headers (content-type, x-powered-by)
Phase 3: Partial GET (Range: 0-8KB)
└─> Extract <head> section only
└─> Parse: title, description, canonical, robots
└─> Parse: og:type, og:site_name, generator
└─> Detect CMS from generator meta tag
Detection Capabilities:
- CMS detection: WordPress, Shopify, Ghost, Hugo, Jekyll, Wix, Squarespace, Webflow
- og:type mapping: product → ecommerce, article → blog, music/video → streaming
- Parked domain detection: "domain for sale", "coming soon" patterns
- Content signals: SaaS keywords, ecommerce keywords, news patterns
Why It's Better Than Instant Pages First:
- FREE - No API costs
- Fast - Only fetches 8KB vs full page
- Low Detection - HEAD + Range header looks like a browser prefetch
- Sufficient for 70%+ domains - CMS/og:type/generator handles most cases
Brand Name Priority Logic
First real name wins. DataForSEO only fills gaps.
| Current brand.name | website_name from DFS | Action |
|---|---|---|
| NULL | "Spotify AB" | Update to "Spotify AB" |
| "spotify" (domain fallback) | "Spotify AB" | Update to "Spotify AB" |
| "Spotify" (from developer_name) | "Spotify AB" | Keep "Spotify" |
| "Spotify" | NULL | Keep "Spotify" |
// Brand update logic
const isPlaceholderName = !brand.name ||
brand.name === domain ||
brand.name === domain.replace(/\.(com|io|co|app)$/, '');
if (websiteName && isPlaceholderName) {
await updateBrandName(brand.id, websiteName);
}
API Costs Per Domain
| Step | Endpoint | Cost | Data Retrieved |
|---|---|---|---|
| Ranked Keywords | dataforseo_labs/google/ranked_keywords/live | $0.03 | Keywords domain ranks for, search intent, SERP features |
| Domain Summary | backlinks/summary/live | $0.02 | Total backlinks, referring domains, spam score, DR |
| Referring Domains | backlinks/referring_domains/live | $0.04 | List of linking domains with metrics |
| Backlinks (per domain) | backlinks/backlinks/live | $0.04 | Individual backlink URLs and anchor text |
Minimum onboard (classification only): $0.05/domain (ranked keywords + summary) Full onboard (with backlinks): $0.09+ depending on referring domain count
Control Parameters
Endpoint: POST /api/admin/domains/onboard
{
"domain": "example.com",
"options": {
"fetch_ranked_keywords": true, // $0.03 - needed for classification
"fetch_domain_summary": true, // $0.02 - backlink aggregate stats
"fetch_referring_domains": false, // $0.04 - list of linking domains
"fetch_backlinks": false, // $0.04/batch - individual backlinks
"classify_domain": true, // FREE - rules/vectorize/LLM
"classify_backlinks": false, // FREE - but slow, queue-based
"create_brand": true // FREE - extract from website_name
}
}
Batch/Cron: POST /api/admin/domains/enrich-pending
{
"limit": 100,
"filter": {
"missing_property_type": true, // domains without classification
"missing_summary": true, // domains without backlink stats
"older_than_days": 30 // re-enrich stale data
},
"options": {
"fetch_ranked_keywords": true,
"fetch_domain_summary": true,
"fetch_referring_domains": false,
"classify_domain": true
}
}
Database Tables Involved
| Table | Updated By | Key Fields |
|---|---|---|
domains | domain-onboard | property_type, tier1_type, brand_id, domain_rank |
domain_summaries | domain-onboard | backlinks_count, referring_domains_count, spam_score |
domain_keyword_rankings | domain-onboard | keyword, position, search_volume, intent |
brands | domain-onboard | name, primary_domain_id |
referring_domains | domain-onboard | source_domain_id, target_domain_id, backlinks_count |
backlinks | backlink-classify | page_type, tactic_type, quality_tier |
Trigger Options
Option 1: Queue on Insert (Real-time)
-- Pseudo-trigger (implemented in app code)
ON INSERT INTO domains
→ Queue to domain-onboard if auto_enrich = true
Option 2: Scheduled Cron (Batch)
0 * * * * → Find domains missing classification → Queue batch
Option 3: Manual via API
POST /api/admin/domains/onboard { domain: "example.com" }
Recommended Default Flow
For new domains discovered via app crawls:
- Insert domain with
needs_enrichment = true - Cron runs hourly, finds pending domains
- Queues to domain-onboard with minimal options:
fetch_ranked_keywords: true(for classification)fetch_domain_summary: true(for DR/spam score)classify_domain: truefetch_backlinks: false(too expensive for bulk)
For domains we care about (competitors, tracked brands):
- Manual trigger via API with full options
- Or flag domain with
priority = highfor full enrichment