Domain Onboarding Flow

Mermaid Diagram

flowchart TD
    subgraph AppCrawl["App Store Crawl (Entry Point)"]
        APP[Crawl App Store] --> SCRAPE[HTML Scrape<br/>FREE - Desktop Safari UA]
        SCRAPE --> ITUNES[iTunes API<br/>FREE - authoritative metadata]
        ITUNES --> APP_DETAILS[app-details-consumer]
        
        APP_DETAILS --> BRAND_CREATE[ensureBrand<br/>name from developer_name]
        APP_DETAILS --> DOMAIN_CREATE[ensureDomain<br/>from developer_url]
        APP_DETAILS --> ICON[Upload icon<br/>to CF Images]
        APP_DETAILS --> SIMILAR{crawl_depth<br/>> 0?}
        SIMILAR -->|Yes| QUEUE_SIMILAR[Queue similar apps<br/>with crawl_depth=0]
        SIMILAR -->|No| SKIP_SIMILAR[Skip similar apps]
    end

    subgraph DomainTrigger["Domain Onboarding Trigger"]
        DOMAIN_CREATE --> TRIGGER{triggerDomainOnboarding<br/>called?}
        TRIGGER -->|Yes| SET_STATUS[Set onboard_status=pending]
        TRIGGER -->|No| MANUAL[Wait for manual trigger]
        
        SET_STATUS --> QUEUE_JOBS[Queue 3 jobs to<br/>domain-onboard queue]
    end

    subgraph DomainOnboard["domain-onboard-consumer (3 Parallel Jobs)"]
        QUEUE_JOBS --> JOB1[fetch_keywords<br/>$0.03 - limit 100]
        QUEUE_JOBS --> JOB2[fetch_summary<br/>$0.02]
        QUEUE_JOBS --> JOB3[fetch_backlinks<br/>$0.04 - limit 50]
        
        JOB1 --> KW_STORE[(domain_keyword_rankings)]
        JOB1 --> BRAND_UPDATE[Update brand.name<br/>from website_name<br/>if placeholder]
        
        JOB2 --> SUMMARY_STORE[(domain_summaries<br/>domain_rank, backlink_count)]
        
        JOB3 --> BL_STORE[(backlinks)]
        JOB3 --> BL_QUEUE[Queue URLs to<br/>backlink-classify]
        
        KW_STORE --> CHECK_COMPLETE{All 3 jobs<br/>complete?}
        SUMMARY_STORE --> CHECK_COMPLETE
        BL_STORE --> CHECK_COMPLETE
        
        CHECK_COMPLETE -->|Yes| COMPLETE[onboard_status=complete]
        CHECK_COMPLETE -->|No| WAIT[Wait for other jobs]
    end

    subgraph BacklinkClassify["backlink-classify-consumer"]
        BL_QUEUE --> URL_CLASS[Classify each URL<br/>Rules → Vectorize → LLM]
        URL_CLASS --> URL_STORE[Update backlinks:<br/>page_type, tactic_type,<br/>quality_tier]
    end

    subgraph Classification["Domain Classification (Separate API Call)"]
        CLASSIFY_API[POST /admin/classifier/domain] --> CLASSIFIER[classifyDomain]
        CLASSIFIER --> C0{Cache hit?}
        C0 -->|Yes| C_DONE[Return cached]
        C0 -->|No| C1[Stage 1: Rules<br/>FREE]
        C1 --> C1_5[Stage 1.5: Google Ads Categories<br/>FREE]
        C1_5 --> C2[Stage 2: Vectorize<br/>FREE]
        C2 --> C3[Stage 3: Low-Noise Crawl<br/>FREE]
        C3 --> C3_CONF{≥70%?}
        C3_CONF -->|Yes| C_DONE
        C3_CONF -->|No| C4[Stage 4: Instant Pages<br/>$0.000125]
        C4 --> C4_5[Stage 4.5: Domain Patterns<br/>FREE]
        C4_5 --> C5{Still uncertain?}
        C5 -->|Yes| C6[Stage 5: LLM<br/>~$0.0001]
        C5 -->|No| C_DONE
        C6 --> C_DONE[Store & Learn]
    end

    subgraph OtherTriggers["Other Domain Entry Points"]
        A2[Backlink Discovery] --> DOMAIN_CREATE2[ensureDomain]
        A3[Manual API Add] --> DOMAIN_CREATE2
        A4[SERP Crawl] --> DOMAIN_CREATE2
        DOMAIN_CREATE2 --> TRIGGER
    end

    style SCRAPE fill:#c8e6c9
    style ITUNES fill:#c8e6c9
    style JOB1 fill:#fff3e0
    style JOB2 fill:#fff3e0
    style JOB3 fill:#fff3e0
    style C3 fill:#c8e6c9
    style C4 fill:#fff3e0
    style C6 fill:#ffcdd2
    style COMPLETE fill:#c8e6c9

Key Changes from Previous Version:

App scraping: HTML scrape is FREE and runs FIRST, iTunes API second
Domain onboard queues 3 PARALLEL jobs, not sequential
Classification is a SEPARATE API call, not automatic
Similar apps use crawl_depth=0 to prevent infinite recursion
Limits reduced: 100 keywords, 50 backlinks (D1 rate limits)

Domain Classification Pipeline

The classifier uses a cost-optimized pipeline - FREE stages run first, PAID stages only when needed:

flowchart LR
    subgraph FREE["FREE Stages"]
        S0[Cache Check] --> S1[Rules Engine]
        S1 --> S1_5[Google Ads<br/>Categories]
        S1_5 --> S2[Vectorize]
        S2 --> S3[Low-Noise Crawl]
    end
    
    subgraph PAID["PAID Stages (only if needed)"]
        S3 --> S4[Instant Pages<br/>$0.000125]
        S4 --> S5[LLM Fallback<br/>~$0.0001]
    end
    
    S3 -->|"≥70% confidence"| DONE[Done]
    S4 -->|"≥70% confidence"| DONE
    S5 --> DONE
    
    style S0 fill:#e8f5e9
    style S1 fill:#e8f5e9
    style S1_5 fill:#e8f5e9
    style S2 fill:#e8f5e9
    style S3 fill:#c8e6c9
    style S4 fill:#fff3e0
    style S5 fill:#ffcdd2

Stage Details

Stage	Name	Cost	What It Does
0	Cache	FREE	Check if domain already classified in D1
1	Rules	FREE	Known domains, TLDs (.gov/.edu), subdomain services, platform patterns
1.5	Google Ads Categories	FREE	Use cached DFS category data to derive tier1_type
2	Vectorize	FREE	Semantic similarity to known classified domains
3	Low-Noise Crawl	FREE	HEAD + partial GET (8KB), extract `<head>` metadata
4	Instant Pages	$0.000125	DataForSEO full page fetch (only if low-noise insufficient)
4.5	Domain Patterns	FREE	Fallback rules for placeholder pages
5	LLM	~$0.0001	Workers AI for ambiguous cases

Low-Noise Crawl (Stage 3)

The low-noise crawler is a FREE alternative to DataForSEO Instant Pages:

Phase 1: DNS Resolution
  └─> Check if domain resolves (root vs www)
  └─> Determine canonical host

Phase 2: HEAD Request  
  └─> Follow redirects
  └─> Capture server headers (content-type, x-powered-by)

Phase 3: Partial GET (Range: 0-8KB)
  └─> Extract <head> section only
  └─> Parse: title, description, canonical, robots
  └─> Parse: og:type, og:site_name, generator
  └─> Detect CMS from generator meta tag

Detection Capabilities:

CMS detection: WordPress, Shopify, Ghost, Hugo, Jekyll, Wix, Squarespace, Webflow
og:type mapping: product → ecommerce, article → blog, music/video → streaming
Parked domain detection: "domain for sale", "coming soon" patterns
Content signals: SaaS keywords, ecommerce keywords, news patterns

Why It's Better Than Instant Pages First:

FREE - No API costs
Fast - Only fetches 8KB vs full page
Low Detection - HEAD + Range header looks like a browser prefetch
Sufficient for 70%+ domains - CMS/og:type/generator handles most cases

Brand Name Priority Logic

First real name wins. DataForSEO only fills gaps.

Current brand.name	website_name from DFS	Action
NULL	"Spotify AB"	Update to "Spotify AB"
"spotify" (domain fallback)	"Spotify AB"	Update to "Spotify AB"
"Spotify" (from developer_name)	"Spotify AB"	Keep "Spotify"
"Spotify"	NULL	Keep "Spotify"

// Brand update logic
const isPlaceholderName = !brand.name || 
                          brand.name === domain || 
                          brand.name === domain.replace(/\.(com|io|co|app)$/, '');

if (websiteName && isPlaceholderName) {
  await updateBrandName(brand.id, websiteName);
}

API Costs Per Domain

Step	Endpoint	Cost	Data Retrieved
Ranked Keywords	`dataforseo_labs/google/ranked_keywords/live`	$0.03	Keywords domain ranks for, search intent, SERP features
Domain Summary	`backlinks/summary/live`	$0.02	Total backlinks, referring domains, spam score, DR
Referring Domains	`backlinks/referring_domains/live`	$0.04	List of linking domains with metrics
Backlinks (per domain)	`backlinks/backlinks/live`	$0.04	Individual backlink URLs and anchor text

Minimum onboard (classification only): $0.05/domain (ranked keywords + summary) Full onboard (with backlinks): $0.09+ depending on referring domain count

Control Parameters

Endpoint: `POST /api/admin/domains/onboard`

{
  "domain": "example.com",
  "options": {
    "fetch_ranked_keywords": true,    // $0.03 - needed for classification
    "fetch_domain_summary": true,     // $0.02 - backlink aggregate stats
    "fetch_referring_domains": false, // $0.04 - list of linking domains
    "fetch_backlinks": false,         // $0.04/batch - individual backlinks
    "classify_domain": true,          // FREE - rules/vectorize/LLM
    "classify_backlinks": false,      // FREE - but slow, queue-based
    "create_brand": true              // FREE - extract from website_name
  }
}

Batch/Cron: `POST /api/admin/domains/enrich-pending`

{
  "limit": 100,
  "filter": {
    "missing_property_type": true,    // domains without classification
    "missing_summary": true,          // domains without backlink stats
    "older_than_days": 30             // re-enrich stale data
  },
  "options": {
    "fetch_ranked_keywords": true,
    "fetch_domain_summary": true,
    "fetch_referring_domains": false,
    "classify_domain": true
  }
}

Database Tables Involved

Table	Updated By	Key Fields
`domains`	domain-onboard	property_type, tier1_type, brand_id, domain_rank
`domain_summaries`	domain-onboard	backlinks_count, referring_domains_count, spam_score
`domain_keyword_rankings`	domain-onboard	keyword, position, search_volume, intent
`brands`	domain-onboard	name, primary_domain_id
`referring_domains`	domain-onboard	source_domain_id, target_domain_id, backlinks_count
`backlinks`	backlink-classify	page_type, tactic_type, quality_tier

Trigger Options

Option 1: Queue on Insert (Real-time)

-- Pseudo-trigger (implemented in app code)
ON INSERT INTO domains
  → Queue to domain-onboard if auto_enrich = true

Option 2: Scheduled Cron (Batch)

0 * * * * → Find domains missing classification → Queue batch

Option 3: Manual via API

POST /api/admin/domains/onboard { domain: "example.com" }

Recommended Default Flow

For new domains discovered via app crawls:

Insert domain with needs_enrichment = true
Cron runs hourly, finds pending domains
Queues to domain-onboard with minimal options:
- fetch_ranked_keywords: true (for classification)
- fetch_domain_summary: true (for DR/spam score)
- classify_domain: true
- fetch_backlinks: false (too expensive for bulk)

For domains we care about (competitors, tracked brands):

Manual trigger via API with full options
Or flag domain with priority = high for full enrichment

Mermaid Diagram​

Domain Classification Pipeline​

Stage Details​

Low-Noise Crawl (Stage 3)​

Brand Name Priority Logic​

API Costs Per Domain​

Control Parameters​

Endpoint: POST /api/admin/domains/onboard​

Batch/Cron: POST /api/admin/domains/enrich-pending​

Database Tables Involved​

Trigger Options​

Option 1: Queue on Insert (Real-time)​

Option 2: Scheduled Cron (Batch)​

Option 3: Manual via API​

Recommended Default Flow​