Skip to main content

System Overview

The highest-level view of RankFabric's data pipelines.

Master System Diagram

flowchart TB
subgraph EntryPoints["Data Entry Points"]
UI[React UI<br/>Manual domain/asset add]
APPLE[Apple App Store<br/>Chart crawls]
GOOGLE[Google Play<br/>DataForSEO webhooks]
SERP[SERP Tracking<br/>Daily keyword rankings]
end

subgraph AppCrawl["App Store Pipeline"]
APPLE --> CRAWL_APPLE[HTML Scrape<br/>FREE]
GOOGLE --> CRAWL_GOOGLE[DataForSEO Webhook<br/>PAID]
CRAWL_APPLE --> ITUNES[iTunes API<br/>FREE fallback]
CRAWL_APPLE --> APP_DETAILS[app-details-consumer]
CRAWL_GOOGLE --> APP_DETAILS
ITUNES --> APP_DETAILS

APP_DETAILS --> BRAND_CREATE[ensureBrand<br/>Create/link brand]
APP_DETAILS --> DOMAIN_CREATE[ensureDomain<br/>from developer_url]
APP_DETAILS --> ICON_UPLOAD[CF Images<br/>Upload icon]
APP_DETAILS --> SIMILAR[Queue similar apps<br/>crawl_depth=0]
end

subgraph SerpPipeline["SERP Tracking Pipeline"]
SERP --> SERP_CONSUMER[serp-consumer]
SERP_CONSUMER --> SERP_ENSURE[ensureUrl for each<br/>ranking URL]
SERP_ENSURE --> SERP_DOMAIN[ensureDomain<br/>auto-queues classification]
SERP_CONSUMER --> SERP_STORE[(serp_positions<br/>serp_runs)]
end

subgraph UrlFlow["URL/Domain Auto-Classification"]
DOMAIN_CREATE --> ENSURE_URL[ensureUrl / ensureDomain]
SERP_DOMAIN --> ENSURE_URL
UI_RESOLVE --> ENSURE_URL

ENSURE_URL --> CHECK_CLASSIFIED{Already<br/>classified?}
CHECK_CLASSIFIED -->|No| DOMAIN_Q[domain-classify queue]
CHECK_CLASSIFIED -->|No| URL_Q[backlink-classify queue]
CHECK_CLASSIFIED -->|Yes| SKIP[Skip - use cached]
end

subgraph DomainPipeline["Domain Enrichment Pipeline (On-Demand)"]
UI --> UI_RESOLVE[Resolve www vs non-www]

ENRICH_API[API: /admin/domains/enrich] --> ENRICH_QUEUE[domain-enrich queue]

ENRICH_QUEUE --> JOB1[fetch_keywords<br/>$0.03]
ENRICH_QUEUE --> JOB2[fetch_summary<br/>$0.02]
ENRICH_QUEUE --> JOB3[fetch_backlinks<br/>$0.04]

JOB1 --> DFS_HINTS[DataForSEO provides:<br/>website_name, categories,<br/>platform_types]
JOB1 --> KW_STORE[(domain_keyword_rankings)]
JOB2 --> SUMMARY_STORE[(domain_summaries)]
JOB3 --> BL_STORE[(backlinks)]
JOB3 --> BL_URL_QUEUE[Queue source URLs<br/>to backlink-classify]

DFS_HINTS -.->|Hints for| CLASSIFIER
end

subgraph Classification["Domain Classification (7 Stages)"]
DOMAIN_Q --> CLASSIFIER[Domain Classifier]
CLASSIFIER --> S0[0. Cache]
S0 --> S1[1. Rules<br/>FREE]
S1 --> S1_5[1.5 Google Ads Categories<br/>from DFS cache - FREE]
S1_5 --> S2[2. Vectorize<br/>FREE]
S2 --> S3[3. Low-Noise Crawl<br/>FREE]
S3 --> S4[4. Instant Pages<br/>$0.000125]
S4 --> S5[5. LLM<br/>~$0.0001]

S3 -->|">=70%"| CLASS_DONE[Store classification]
S4 -->|">=70%"| CLASS_DONE
S5 --> CLASS_DONE

S5 -->|">=80% confidence"| LEARN_DOMAIN[Learn: upsert to Vectorize]
LEARN_DOMAIN -.->|Improves future| S2
end

subgraph UrlClassification["URL Classification"]
URL_Q --> URL_CLASSIFY[URL Classifier<br/>Rules -> Vectorize -> LLM]
BL_URL_QUEUE --> URL_CLASSIFY
URL_CLASSIFY --> URL_UPDATE[(Update urls:<br/>page_type, quality_tier)]

URL_CLASSIFY -->|">=65% confidence"| LEARN_URL[Learn: upsert to Vectorize]
LEARN_URL -.->|Improves future| URL_CLASSIFY
end

subgraph Vectorize["Vectorize Index (Self-Learning)"]
VECTORIZE_DB[(VECTORIZE_DOMAINS<br/>Embeddings + metadata)]
LEARN_DOMAIN --> VECTORIZE_DB
LEARN_URL --> VECTORIZE_DB
S2 -.->|Query similar| VECTORIZE_DB
end

subgraph Storage["Data Storage"]
D1[(Cloudflare D1<br/>apps, domains, urls, brands,<br/>rankings, backlinks)]
KV[(KV<br/>Run state, budgets)]
R2[(R2<br/>Raw HTML)]
IMAGES[(CF Images<br/>App icons)]
CH[(ClickHouse<br/>Analytics, trends)]
end

APP_DETAILS --> D1
KW_STORE --> D1
SUMMARY_STORE --> D1
BL_STORE --> D1
URL_UPDATE --> D1
CLASS_DONE --> D1
ICON_UPLOAD --> IMAGES
SERP_STORE --> D1
SERP_STORE --> CH

style CRAWL_APPLE fill:#c8e6c9
style S1 fill:#c8e6c9
style S2 fill:#c8e6c9
style S3 fill:#c8e6c9
style S4 fill:#fff3e0
style S5 fill:#ffcdd2
style JOB1 fill:#fff3e0
style JOB2 fill:#fff3e0
style JOB3 fill:#fff3e0
style ENSURE_URL fill:#e1bee7
style LEARN_DOMAIN fill:#b3e5fc
style LEARN_URL fill:#b3e5fc
style VECTORIZE_DB fill:#b3e5fc

How URLs/Domains Get Classified

Every URL that enters the system flows through ensureUrl(), which:

  1. Creates/updates the urls record
  2. Calls ensureDomain() to create/update the domain record
  3. Auto-queues unclassified domains to domain-classify queue
  4. Auto-queues unclassified URLs to backlink-classify queue

This happens for URLs from ALL sources:

  • SERP tracking (new competitor URLs in rankings)
  • App store crawls (developer URLs)
  • Backlink fetches (source URLs)
  • Manual additions via UI

Vectorize Learning (Self-Improving System)

The classification system gets smarter over time:

LLM Classification (>=80% confidence for domains, >=65% for URLs)
|
v
Generate embedding from: domain + title + description + content
|
v
Upsert to VECTORIZE_DOMAINS index with metadata:
{ domain, property_type, channel, confidence, classified_at }
|
v
Future Stage 2 queries find similar domains/URLs
|
v
Better classification without LLM cost

Key Files:

  • src/lib/domain-classifier.js - learnDomainClassification() at line 2481
  • src/lib/backlink-classifier.js - learnFromClassification() at line ~90
  • src/lib/classifier-vectorize.js - addClassifiedUrl() at line 405

DataForSEO Classification Hints

DataForSEO provides classification-useful data that we use (and could use more):

Currently Used

EndpointFieldHow We Use It
ranked_keywordswebsite_nameCreate/upgrade brand names
ranked_keywordssearch_intent_info.main_intentKeyword intent classification
backlinksbacklink_spam_scoreRisk scoring
backlinksdomain_from_countryRisk flags (RU, CN, VN)
backlinkspage_from_external_linksLink farm detection (>100 = risky)
summaryreferring_links_platform_typesStore counts (news, blogs, etc.)

Stage 1.5: Google Ads Categories

We cache dfs_category_path from DataForSEO and use it in classification:

  • /Computers & Electronics/Software -> saas_product
  • /Retailers & General Merchandise -> ecommerce_store
  • /News, Media & Publications/News -> news_publisher

Untapped Opportunities

EndpointFieldPotential Use
ranked_keywordskeyword_data.categoriesAggregate to determine domain industry
summaryreferring_links_tldTLD diversity = organic vs spam
summaryreferring_links_semantic_locationsFooter-heavy = low quality
backlinkstext_pre, text_postEditorial vs advertorial detection
backlinksurl_to_spam_scoreMoney page quality assessment

Pipeline Summary

PipelineEntry PointKey StepsStorage
App StoreChart crawls, webhooksHTML scrape -> iTunes -> ensureBrand -> ensureDomain -> CF ImagesD1 (apps, brands, rankings)
SERP TrackingDaily cronTrack positions -> ensureUrl for each result -> auto-classifyD1 (serp_positions) + ClickHouse
Domain EnrichmentOn-demand API3 parallel jobs: keywords, summary, backlinksD1 (domains, keywords, backlinks)
Domain ClassificationAuto (via ensureDomain)7 stages: Cache -> Rules -> Categories -> Vectorize -> Low-Noise -> Instant -> LLMD1 (domains.property_type, tier1_type)
URL ClassificationAuto (via ensureUrl)Rules -> Vectorize -> LLM per URLD1 (urls.page_type, quality_tier)

Classification Triggers

TriggerWhat Happens
ensureDomain(url)If domain not classified with >=60% confidence, queues to domain-classify
ensureUrl(url)If URL not classified with >=60% confidence, queues to backlink-classify
POST /admin/classifier/domainForce classify a domain (bypasses queue)
POST /admin/domains/enrichFetch stats + backlinks, which triggers URL classification

Learning Thresholds

TypeConfidence ThresholdWhat Gets Learned
Domain>=80%domain + page metadata -> Vectorize
URL>=65%url + domain + classification -> Vectorize

Only LLM classifications trigger learning (Rules/Vectorize/Content results are NOT fed back to avoid circular learning).

Cost Tiers

TierOperationsCost
FREEHTML scrape, iTunes API, Rules, Vectorize, Low-Noise Crawl$0
CHEAPInstant Pages, Domain Summary$0.000125-0.02
MODERATERanked Keywords, Backlinks$0.03-0.04
EXPENSIVELLM calls, DataForSEO app fallback$0.0001-0.01

Key Optimizations

  1. Auto-classification via ensureUrl/ensureDomain - Every URL/domain gets classified automatically
  2. Cache check first - Skip classification if already done with sufficient confidence
  3. DataForSEO hints at Stage 1.5 - Use cached category data before expensive stages
  4. Low-noise crawl before Instant Pages - 70% of domains classified without paid API
  5. Self-learning via Vectorize - High-confidence LLM results improve future classifications
  6. HTML scrape before paid APIs - Apple apps fetched free first
  7. Similar apps with crawl_depth=0 - Prevents infinite recursion
  8. 3 parallel enrichment jobs - Keywords, summary, backlinks run independently