Skip to main content

Domain Onboarding Flow

Mermaid Diagram

flowchart TD
subgraph AppCrawl["App Store Crawl (Entry Point)"]
APP[Crawl App Store] --> SCRAPE[HTML Scrape<br/>FREE - Desktop Safari UA]
SCRAPE --> ITUNES[iTunes API<br/>FREE - authoritative metadata]
ITUNES --> APP_DETAILS[app-details-consumer]

APP_DETAILS --> BRAND_CREATE[ensureBrand<br/>name from developer_name]
APP_DETAILS --> DOMAIN_CREATE[ensureDomain<br/>from developer_url]
APP_DETAILS --> ICON[Upload icon<br/>to CF Images]
APP_DETAILS --> SIMILAR{crawl_depth<br/>> 0?}
SIMILAR -->|Yes| QUEUE_SIMILAR[Queue similar apps<br/>with crawl_depth=0]
SIMILAR -->|No| SKIP_SIMILAR[Skip similar apps]
end

subgraph DomainTrigger["Domain Onboarding Trigger"]
DOMAIN_CREATE --> TRIGGER{triggerDomainOnboarding<br/>called?}
TRIGGER -->|Yes| SET_STATUS[Set onboard_status=pending]
TRIGGER -->|No| MANUAL[Wait for manual trigger]

SET_STATUS --> QUEUE_JOBS[Queue 3 jobs to<br/>domain-onboard queue]
end

subgraph DomainOnboard["domain-onboard-consumer (3 Parallel Jobs)"]
QUEUE_JOBS --> JOB1[fetch_keywords<br/>$0.03 - limit 100]
QUEUE_JOBS --> JOB2[fetch_summary<br/>$0.02]
QUEUE_JOBS --> JOB3[fetch_backlinks<br/>$0.04 - limit 50]

JOB1 --> KW_STORE[(domain_keyword_rankings)]
JOB1 --> BRAND_UPDATE[Update brand.name<br/>from website_name<br/>if placeholder]

JOB2 --> SUMMARY_STORE[(domain_summaries<br/>domain_rank, backlink_count)]

JOB3 --> BL_STORE[(backlinks)]
JOB3 --> BL_QUEUE[Queue URLs to<br/>backlink-classify]

KW_STORE --> CHECK_COMPLETE{All 3 jobs<br/>complete?}
SUMMARY_STORE --> CHECK_COMPLETE
BL_STORE --> CHECK_COMPLETE

CHECK_COMPLETE -->|Yes| COMPLETE[onboard_status=complete]
CHECK_COMPLETE -->|No| WAIT[Wait for other jobs]
end

subgraph BacklinkClassify["backlink-classify-consumer"]
BL_QUEUE --> URL_CLASS[Classify each URL<br/>Rules → Vectorize → LLM]
URL_CLASS --> URL_STORE[Update backlinks:<br/>page_type, tactic_type,<br/>quality_tier]
end

subgraph Classification["Domain Classification (Separate API Call)"]
CLASSIFY_API[POST /admin/classifier/domain] --> CLASSIFIER[classifyDomain]
CLASSIFIER --> C0{Cache hit?}
C0 -->|Yes| C_DONE[Return cached]
C0 -->|No| C1[Stage 1: Rules<br/>FREE]
C1 --> C1_5[Stage 1.5: Google Ads Categories<br/>FREE]
C1_5 --> C2[Stage 2: Vectorize<br/>FREE]
C2 --> C3[Stage 3: Low-Noise Crawl<br/>FREE]
C3 --> C3_CONF{≥70%?}
C3_CONF -->|Yes| C_DONE
C3_CONF -->|No| C4[Stage 4: Instant Pages<br/>$0.000125]
C4 --> C4_5[Stage 4.5: Domain Patterns<br/>FREE]
C4_5 --> C5{Still uncertain?}
C5 -->|Yes| C6[Stage 5: LLM<br/>~$0.0001]
C5 -->|No| C_DONE
C6 --> C_DONE[Store & Learn]
end

subgraph OtherTriggers["Other Domain Entry Points"]
A2[Backlink Discovery] --> DOMAIN_CREATE2[ensureDomain]
A3[Manual API Add] --> DOMAIN_CREATE2
A4[SERP Crawl] --> DOMAIN_CREATE2
DOMAIN_CREATE2 --> TRIGGER
end

style SCRAPE fill:#c8e6c9
style ITUNES fill:#c8e6c9
style JOB1 fill:#fff3e0
style JOB2 fill:#fff3e0
style JOB3 fill:#fff3e0
style C3 fill:#c8e6c9
style C4 fill:#fff3e0
style C6 fill:#ffcdd2
style COMPLETE fill:#c8e6c9

Key Changes from Previous Version:

  • App scraping: HTML scrape is FREE and runs FIRST, iTunes API second
  • Domain onboard queues 3 PARALLEL jobs, not sequential
  • Classification is a SEPARATE API call, not automatic
  • Similar apps use crawl_depth=0 to prevent infinite recursion
  • Limits reduced: 100 keywords, 50 backlinks (D1 rate limits)

Domain Classification Pipeline

The classifier uses a cost-optimized pipeline - FREE stages run first, PAID stages only when needed:

flowchart LR
subgraph FREE["FREE Stages"]
S0[Cache Check] --> S1[Rules Engine]
S1 --> S1_5[Google Ads<br/>Categories]
S1_5 --> S2[Vectorize]
S2 --> S3[Low-Noise Crawl]
end

subgraph PAID["PAID Stages (only if needed)"]
S3 --> S4[Instant Pages<br/>$0.000125]
S4 --> S5[LLM Fallback<br/>~$0.0001]
end

S3 -->|"≥70% confidence"| DONE[Done]
S4 -->|"≥70% confidence"| DONE
S5 --> DONE

style S0 fill:#e8f5e9
style S1 fill:#e8f5e9
style S1_5 fill:#e8f5e9
style S2 fill:#e8f5e9
style S3 fill:#c8e6c9
style S4 fill:#fff3e0
style S5 fill:#ffcdd2

Stage Details

StageNameCostWhat It Does
0CacheFREECheck if domain already classified in D1
1RulesFREEKnown domains, TLDs (.gov/.edu), subdomain services, platform patterns
1.5Google Ads CategoriesFREEUse cached DFS category data to derive tier1_type
2VectorizeFREESemantic similarity to known classified domains
3Low-Noise CrawlFREEHEAD + partial GET (8KB), extract <head> metadata
4Instant Pages$0.000125DataForSEO full page fetch (only if low-noise insufficient)
4.5Domain PatternsFREEFallback rules for placeholder pages
5LLM~$0.0001Workers AI for ambiguous cases

Low-Noise Crawl (Stage 3)

The low-noise crawler is a FREE alternative to DataForSEO Instant Pages:

Phase 1: DNS Resolution
└─> Check if domain resolves (root vs www)
└─> Determine canonical host

Phase 2: HEAD Request
└─> Follow redirects
└─> Capture server headers (content-type, x-powered-by)

Phase 3: Partial GET (Range: 0-8KB)
└─> Extract <head> section only
└─> Parse: title, description, canonical, robots
└─> Parse: og:type, og:site_name, generator
└─> Detect CMS from generator meta tag

Detection Capabilities:

  • CMS detection: WordPress, Shopify, Ghost, Hugo, Jekyll, Wix, Squarespace, Webflow
  • og:type mapping: product → ecommerce, article → blog, music/video → streaming
  • Parked domain detection: "domain for sale", "coming soon" patterns
  • Content signals: SaaS keywords, ecommerce keywords, news patterns

Why It's Better Than Instant Pages First:

  • FREE - No API costs
  • Fast - Only fetches 8KB vs full page
  • Low Detection - HEAD + Range header looks like a browser prefetch
  • Sufficient for 70%+ domains - CMS/og:type/generator handles most cases

Brand Name Priority Logic

First real name wins. DataForSEO only fills gaps.

Current brand.namewebsite_name from DFSAction
NULL"Spotify AB"Update to "Spotify AB"
"spotify" (domain fallback)"Spotify AB"Update to "Spotify AB"
"Spotify" (from developer_name)"Spotify AB"Keep "Spotify"
"Spotify"NULLKeep "Spotify"
// Brand update logic
const isPlaceholderName = !brand.name ||
brand.name === domain ||
brand.name === domain.replace(/\.(com|io|co|app)$/, '');

if (websiteName && isPlaceholderName) {
await updateBrandName(brand.id, websiteName);
}

API Costs Per Domain

StepEndpointCostData Retrieved
Ranked Keywordsdataforseo_labs/google/ranked_keywords/live$0.03Keywords domain ranks for, search intent, SERP features
Domain Summarybacklinks/summary/live$0.02Total backlinks, referring domains, spam score, DR
Referring Domainsbacklinks/referring_domains/live$0.04List of linking domains with metrics
Backlinks (per domain)backlinks/backlinks/live$0.04Individual backlink URLs and anchor text

Minimum onboard (classification only): $0.05/domain (ranked keywords + summary) Full onboard (with backlinks): $0.09+ depending on referring domain count

Control Parameters

Endpoint: POST /api/admin/domains/onboard

{
"domain": "example.com",
"options": {
"fetch_ranked_keywords": true, // $0.03 - needed for classification
"fetch_domain_summary": true, // $0.02 - backlink aggregate stats
"fetch_referring_domains": false, // $0.04 - list of linking domains
"fetch_backlinks": false, // $0.04/batch - individual backlinks
"classify_domain": true, // FREE - rules/vectorize/LLM
"classify_backlinks": false, // FREE - but slow, queue-based
"create_brand": true // FREE - extract from website_name
}
}

Batch/Cron: POST /api/admin/domains/enrich-pending

{
"limit": 100,
"filter": {
"missing_property_type": true, // domains without classification
"missing_summary": true, // domains without backlink stats
"older_than_days": 30 // re-enrich stale data
},
"options": {
"fetch_ranked_keywords": true,
"fetch_domain_summary": true,
"fetch_referring_domains": false,
"classify_domain": true
}
}

Database Tables Involved

TableUpdated ByKey Fields
domainsdomain-onboardproperty_type, tier1_type, brand_id, domain_rank
domain_summariesdomain-onboardbacklinks_count, referring_domains_count, spam_score
domain_keyword_rankingsdomain-onboardkeyword, position, search_volume, intent
brandsdomain-onboardname, primary_domain_id
referring_domainsdomain-onboardsource_domain_id, target_domain_id, backlinks_count
backlinksbacklink-classifypage_type, tactic_type, quality_tier

Trigger Options

Option 1: Queue on Insert (Real-time)

-- Pseudo-trigger (implemented in app code)
ON INSERT INTO domains
→ Queue to domain-onboard if auto_enrich = true

Option 2: Scheduled Cron (Batch)

0 * * * * → Find domains missing classification → Queue batch

Option 3: Manual via API

POST /api/admin/domains/onboard { domain: "example.com" }

For new domains discovered via app crawls:

  1. Insert domain with needs_enrichment = true
  2. Cron runs hourly, finds pending domains
  3. Queues to domain-onboard with minimal options:
    • fetch_ranked_keywords: true (for classification)
    • fetch_domain_summary: true (for DR/spam score)
    • classify_domain: true
    • fetch_backlinks: false (too expensive for bulk)

For domains we care about (competitors, tracked brands):

  1. Manual trigger via API with full options
  2. Or flag domain with priority = high for full enrichment