System Overview

The highest-level view of RankFabric's data pipelines.

Master System Diagram

flowchart TB
    subgraph EntryPoints["Data Entry Points"]
        UI[React UI<br/>Manual domain/asset add]
        APPLE[Apple App Store<br/>Chart crawls]
        GOOGLE[Google Play<br/>DataForSEO webhooks]
        SERP[SERP Tracking<br/>Daily keyword rankings]
    end

    subgraph AppCrawl["App Store Pipeline"]
        APPLE --> CRAWL_APPLE[HTML Scrape<br/>FREE]
        GOOGLE --> CRAWL_GOOGLE[DataForSEO Webhook<br/>PAID]
        CRAWL_APPLE --> ITUNES[iTunes API<br/>FREE fallback]
        CRAWL_APPLE --> APP_DETAILS[app-details-consumer]
        CRAWL_GOOGLE --> APP_DETAILS
        ITUNES --> APP_DETAILS
        
        APP_DETAILS --> BRAND_CREATE[ensureBrand<br/>Create/link brand]
        APP_DETAILS --> DOMAIN_CREATE[ensureDomain<br/>from developer_url]
        APP_DETAILS --> ICON_UPLOAD[CF Images<br/>Upload icon]
        APP_DETAILS --> SIMILAR[Queue similar apps<br/>crawl_depth=0]
    end

    subgraph SerpPipeline["SERP Tracking Pipeline"]
        SERP --> SERP_CONSUMER[serp-consumer]
        SERP_CONSUMER --> SERP_ENSURE[ensureUrl for each<br/>ranking URL]
        SERP_ENSURE --> SERP_DOMAIN[ensureDomain<br/>auto-queues classification]
        SERP_CONSUMER --> SERP_STORE[(serp_positions<br/>serp_runs)]
    end

    subgraph UrlFlow["URL/Domain Auto-Classification"]
        DOMAIN_CREATE --> ENSURE_URL[ensureUrl / ensureDomain]
        SERP_DOMAIN --> ENSURE_URL
        UI_RESOLVE --> ENSURE_URL
        
        ENSURE_URL --> CHECK_CLASSIFIED{Already<br/>classified?}
        CHECK_CLASSIFIED -->|No| DOMAIN_Q[domain-classify queue]
        CHECK_CLASSIFIED -->|No| URL_Q[backlink-classify queue]
        CHECK_CLASSIFIED -->|Yes| SKIP[Skip - use cached]
    end

    subgraph DomainPipeline["Domain Enrichment Pipeline (On-Demand)"]
        UI --> UI_RESOLVE[Resolve www vs non-www]
        
        ENRICH_API[API: /admin/domains/enrich] --> ENRICH_QUEUE[domain-enrich queue]
        
        ENRICH_QUEUE --> JOB1[fetch_keywords<br/>$0.03]
        ENRICH_QUEUE --> JOB2[fetch_summary<br/>$0.02]
        ENRICH_QUEUE --> JOB3[fetch_backlinks<br/>$0.04]
        
        JOB1 --> DFS_HINTS[DataForSEO provides:<br/>website_name, categories,<br/>platform_types]
        JOB1 --> KW_STORE[(domain_keyword_rankings)]
        JOB2 --> SUMMARY_STORE[(domain_summaries)]
        JOB3 --> BL_STORE[(backlinks)]
        JOB3 --> BL_URL_QUEUE[Queue source URLs<br/>to backlink-classify]
        
        DFS_HINTS -.->|Hints for| CLASSIFIER
    end

    subgraph Classification["Domain Classification (7 Stages)"]
        DOMAIN_Q --> CLASSIFIER[Domain Classifier]
        CLASSIFIER --> S0[0. Cache]
        S0 --> S1[1. Rules<br/>FREE]
        S1 --> S1_5[1.5 Google Ads Categories<br/>from DFS cache - FREE]
        S1_5 --> S2[2. Vectorize<br/>FREE]
        S2 --> S3[3. Low-Noise Crawl<br/>FREE]
        S3 --> S4[4. Instant Pages<br/>$0.000125]
        S4 --> S5[5. LLM<br/>~$0.0001]
        
        S3 -->|">=70%"| CLASS_DONE[Store classification]
        S4 -->|">=70%"| CLASS_DONE
        S5 --> CLASS_DONE
        
        S5 -->|">=80% confidence"| LEARN_DOMAIN[Learn: upsert to Vectorize]
        LEARN_DOMAIN -.->|Improves future| S2
    end

    subgraph UrlClassification["URL Classification"]
        URL_Q --> URL_CLASSIFY[URL Classifier<br/>Rules -> Vectorize -> LLM]
        BL_URL_QUEUE --> URL_CLASSIFY
        URL_CLASSIFY --> URL_UPDATE[(Update urls:<br/>page_type, quality_tier)]
        
        URL_CLASSIFY -->|">=65% confidence"| LEARN_URL[Learn: upsert to Vectorize]
        LEARN_URL -.->|Improves future| URL_CLASSIFY
    end

    subgraph Vectorize["Vectorize Index (Self-Learning)"]
        VECTORIZE_DB[(VECTORIZE_DOMAINS<br/>Embeddings + metadata)]
        LEARN_DOMAIN --> VECTORIZE_DB
        LEARN_URL --> VECTORIZE_DB
        S2 -.->|Query similar| VECTORIZE_DB
    end

    subgraph Storage["Data Storage"]
        D1[(Cloudflare D1<br/>apps, domains, urls, brands,<br/>rankings, backlinks)]
        KV[(KV<br/>Run state, budgets)]
        R2[(R2<br/>Raw HTML)]
        IMAGES[(CF Images<br/>App icons)]
        CH[(ClickHouse<br/>Analytics, trends)]
    end

    APP_DETAILS --> D1
    KW_STORE --> D1
    SUMMARY_STORE --> D1
    BL_STORE --> D1
    URL_UPDATE --> D1
    CLASS_DONE --> D1
    ICON_UPLOAD --> IMAGES
    SERP_STORE --> D1
    SERP_STORE --> CH

    style CRAWL_APPLE fill:#c8e6c9
    style S1 fill:#c8e6c9
    style S2 fill:#c8e6c9
    style S3 fill:#c8e6c9
    style S4 fill:#fff3e0
    style S5 fill:#ffcdd2
    style JOB1 fill:#fff3e0
    style JOB2 fill:#fff3e0
    style JOB3 fill:#fff3e0
    style ENSURE_URL fill:#e1bee7
    style LEARN_DOMAIN fill:#b3e5fc
    style LEARN_URL fill:#b3e5fc
    style VECTORIZE_DB fill:#b3e5fc

How URLs/Domains Get Classified

Every URL that enters the system flows through ensureUrl(), which:

Creates/updates the urls record
Calls ensureDomain() to create/update the domain record
Auto-queues unclassified domains to domain-classify queue
Auto-queues unclassified URLs to backlink-classify queue

This happens for URLs from ALL sources:

SERP tracking (new competitor URLs in rankings)
App store crawls (developer URLs)
Backlink fetches (source URLs)
Manual additions via UI

Vectorize Learning (Self-Improving System)

The classification system gets smarter over time:

LLM Classification (>=80% confidence for domains, >=65% for URLs)
    |
    v
Generate embedding from: domain + title + description + content
    |
    v
Upsert to VECTORIZE_DOMAINS index with metadata:
  { domain, property_type, channel, confidence, classified_at }
    |
    v
Future Stage 2 queries find similar domains/URLs
    |
    v
Better classification without LLM cost

Key Files:

src/lib/domain-classifier.js - learnDomainClassification() at line 2481
src/lib/backlink-classifier.js - learnFromClassification() at line ~90
src/lib/classifier-vectorize.js - addClassifiedUrl() at line 405

DataForSEO Classification Hints

DataForSEO provides classification-useful data that we use (and could use more):

Currently Used

Endpoint	Field	How We Use It
ranked_keywords	`website_name`	Create/upgrade brand names
ranked_keywords	`search_intent_info.main_intent`	Keyword intent classification
backlinks	`backlink_spam_score`	Risk scoring
backlinks	`domain_from_country`	Risk flags (RU, CN, VN)
backlinks	`page_from_external_links`	Link farm detection (>100 = risky)
summary	`referring_links_platform_types`	Store counts (news, blogs, etc.)

Stage 1.5: Google Ads Categories

We cache dfs_category_path from DataForSEO and use it in classification:

/Computers & Electronics/Software -> saas_product
/Retailers & General Merchandise -> ecommerce_store
/News, Media & Publications/News -> news_publisher

Untapped Opportunities

Endpoint	Field	Potential Use
ranked_keywords	`keyword_data.categories`	Aggregate to determine domain industry
summary	`referring_links_tld`	TLD diversity = organic vs spam
summary	`referring_links_semantic_locations`	Footer-heavy = low quality
backlinks	`text_pre`, `text_post`	Editorial vs advertorial detection
backlinks	`url_to_spam_score`	Money page quality assessment

Pipeline Summary

Pipeline	Entry Point	Key Steps	Storage
App Store	Chart crawls, webhooks	HTML scrape -> iTunes -> ensureBrand -> ensureDomain -> CF Images	D1 (apps, brands, rankings)
SERP Tracking	Daily cron	Track positions -> ensureUrl for each result -> auto-classify	D1 (serp_positions) + ClickHouse
Domain Enrichment	On-demand API	3 parallel jobs: keywords, summary, backlinks	D1 (domains, keywords, backlinks)
Domain Classification	Auto (via ensureDomain)	7 stages: Cache -> Rules -> Categories -> Vectorize -> Low-Noise -> Instant -> LLM	D1 (domains.property_type, tier1_type)
URL Classification	Auto (via ensureUrl)	Rules -> Vectorize -> LLM per URL	D1 (urls.page_type, quality_tier)

Classification Triggers

Trigger	What Happens
`ensureDomain(url)`	If domain not classified with >=60% confidence, queues to `domain-classify`
`ensureUrl(url)`	If URL not classified with >=60% confidence, queues to `backlink-classify`
`POST /admin/classifier/domain`	Force classify a domain (bypasses queue)
`POST /admin/domains/enrich`	Fetch stats + backlinks, which triggers URL classification

Learning Thresholds

Type	Confidence Threshold	What Gets Learned
Domain	>=80%	domain + page metadata -> Vectorize
URL	>=65%	url + domain + classification -> Vectorize

Only LLM classifications trigger learning (Rules/Vectorize/Content results are NOT fed back to avoid circular learning).

Cost Tiers

Tier	Operations	Cost
FREE	HTML scrape, iTunes API, Rules, Vectorize, Low-Noise Crawl	$0
CHEAP	Instant Pages, Domain Summary	$0.000125-0.02
MODERATE	Ranked Keywords, Backlinks	$0.03-0.04
EXPENSIVE	LLM calls, DataForSEO app fallback	$0.0001-0.01

Key Optimizations

Auto-classification via ensureUrl/ensureDomain - Every URL/domain gets classified automatically
Cache check first - Skip classification if already done with sufficient confidence
DataForSEO hints at Stage 1.5 - Use cached category data before expensive stages
Low-noise crawl before Instant Pages - 70% of domains classified without paid API
Self-learning via Vectorize - High-confidence LLM results improve future classifications
HTML scrape before paid APIs - Apple apps fetched free first
Similar apps with crawl_depth=0 - Prevents infinite recursion
3 parallel enrichment jobs - Keywords, summary, backlinks run independently

Master System Diagram​

How URLs/Domains Get Classified​

Vectorize Learning (Self-Improving System)​

DataForSEO Classification Hints​

Currently Used​

Stage 1.5: Google Ads Categories​

Untapped Opportunities​

Pipeline Summary​

Classification Triggers​

Learning Thresholds​

Cost Tiers​

Key Optimizations​