System Overview
The highest-level view of RankFabric's data pipelines.
Master System Diagram
flowchart TB
subgraph EntryPoints["Data Entry Points"]
UI[React UI<br/>Manual domain/asset add]
APPLE[Apple App Store<br/>Chart crawls]
GOOGLE[Google Play<br/>DataForSEO webhooks]
SERP[SERP Tracking<br/>Daily keyword rankings]
end
subgraph AppCrawl["App Store Pipeline"]
APPLE --> CRAWL_APPLE[HTML Scrape<br/>FREE]
GOOGLE --> CRAWL_GOOGLE[DataForSEO Webhook<br/>PAID]
CRAWL_APPLE --> ITUNES[iTunes API<br/>FREE fallback]
CRAWL_APPLE --> APP_DETAILS[app-details-consumer]
CRAWL_GOOGLE --> APP_DETAILS
ITUNES --> APP_DETAILS
APP_DETAILS --> BRAND_CREATE[ensureBrand<br/>Create/link brand]
APP_DETAILS --> DOMAIN_CREATE[ensureDomain<br/>from developer_url]
APP_DETAILS --> ICON_UPLOAD[CF Images<br/>Upload icon]
APP_DETAILS --> SIMILAR[Queue similar apps<br/>crawl_depth=0]
end
subgraph SerpPipeline["SERP Tracking Pipeline"]
SERP --> SERP_CONSUMER[serp-consumer]
SERP_CONSUMER --> SERP_ENSURE[ensureUrl for each<br/>ranking URL]
SERP_ENSURE --> SERP_DOMAIN[ensureDomain<br/>auto-queues classification]
SERP_CONSUMER --> SERP_STORE[(serp_positions<br/>serp_runs)]
end
subgraph UrlFlow["URL/Domain Auto-Classification"]
DOMAIN_CREATE --> ENSURE_URL[ensureUrl / ensureDomain]
SERP_DOMAIN --> ENSURE_URL
UI_RESOLVE --> ENSURE_URL
ENSURE_URL --> CHECK_CLASSIFIED{Already<br/>classified?}
CHECK_CLASSIFIED -->|No| DOMAIN_Q[domain-classify queue]
CHECK_CLASSIFIED -->|No| URL_Q[backlink-classify queue]
CHECK_CLASSIFIED -->|Yes| SKIP[Skip - use cached]
end
subgraph DomainPipeline["Domain Enrichment Pipeline (On-Demand)"]
UI --> UI_RESOLVE[Resolve www vs non-www]
ENRICH_API[API: /admin/domains/enrich] --> ENRICH_QUEUE[domain-enrich queue]
ENRICH_QUEUE --> JOB1[fetch_keywords<br/>$0.03]
ENRICH_QUEUE --> JOB2[fetch_summary<br/>$0.02]
ENRICH_QUEUE --> JOB3[fetch_backlinks<br/>$0.04]
JOB1 --> DFS_HINTS[DataForSEO provides:<br/>website_name, categories,<br/>platform_types]
JOB1 --> KW_STORE[(domain_keyword_rankings)]
JOB2 --> SUMMARY_STORE[(domain_summaries)]
JOB3 --> BL_STORE[(backlinks)]
JOB3 --> BL_URL_QUEUE[Queue source URLs<br/>to backlink-classify]
DFS_HINTS -.->|Hints for| CLASSIFIER
end
subgraph Classification["Domain Classification (7 Stages)"]
DOMAIN_Q --> CLASSIFIER[Domain Classifier]
CLASSIFIER --> S0[0. Cache]
S0 --> S1[1. Rules<br/>FREE]
S1 --> S1_5[1.5 Google Ads Categories<br/>from DFS cache - FREE]
S1_5 --> S2[2. Vectorize<br/>FREE]
S2 --> S3[3. Low-Noise Crawl<br/>FREE]
S3 --> S4[4. Instant Pages<br/>$0.000125]
S4 --> S5[5. LLM<br/>~$0.0001]
S3 -->|">=70%"| CLASS_DONE[Store classification]
S4 -->|">=70%"| CLASS_DONE
S5 --> CLASS_DONE
S5 -->|">=80% confidence"| LEARN_DOMAIN[Learn: upsert to Vectorize]
LEARN_DOMAIN -.->|Improves future| S2
end
subgraph UrlClassification["URL Classification"]
URL_Q --> URL_CLASSIFY[URL Classifier<br/>Rules -> Vectorize -> LLM]
BL_URL_QUEUE --> URL_CLASSIFY
URL_CLASSIFY --> URL_UPDATE[(Update urls:<br/>page_type, quality_tier)]
URL_CLASSIFY -->|">=65% confidence"| LEARN_URL[Learn: upsert to Vectorize]
LEARN_URL -.->|Improves future| URL_CLASSIFY
end
subgraph Vectorize["Vectorize Index (Self-Learning)"]
VECTORIZE_DB[(VECTORIZE_DOMAINS<br/>Embeddings + metadata)]
LEARN_DOMAIN --> VECTORIZE_DB
LEARN_URL --> VECTORIZE_DB
S2 -.->|Query similar| VECTORIZE_DB
end
subgraph Storage["Data Storage"]
D1[(Cloudflare D1<br/>apps, domains, urls, brands,<br/>rankings, backlinks)]
KV[(KV<br/>Run state, budgets)]
R2[(R2<br/>Raw HTML)]
IMAGES[(CF Images<br/>App icons)]
CH[(ClickHouse<br/>Analytics, trends)]
end
APP_DETAILS --> D1
KW_STORE --> D1
SUMMARY_STORE --> D1
BL_STORE --> D1
URL_UPDATE --> D1
CLASS_DONE --> D1
ICON_UPLOAD --> IMAGES
SERP_STORE --> D1
SERP_STORE --> CH
style CRAWL_APPLE fill:#c8e6c9
style S1 fill:#c8e6c9
style S2 fill:#c8e6c9
style S3 fill:#c8e6c9
style S4 fill:#fff3e0
style S5 fill:#ffcdd2
style JOB1 fill:#fff3e0
style JOB2 fill:#fff3e0
style JOB3 fill:#fff3e0
style ENSURE_URL fill:#e1bee7
style LEARN_DOMAIN fill:#b3e5fc
style LEARN_URL fill:#b3e5fc
style VECTORIZE_DB fill:#b3e5fc
How URLs/Domains Get Classified
Every URL that enters the system flows through ensureUrl(), which:
- Creates/updates the
urlsrecord - Calls
ensureDomain()to create/update the domain record - Auto-queues unclassified domains to
domain-classifyqueue - Auto-queues unclassified URLs to
backlink-classifyqueue
This happens for URLs from ALL sources:
- SERP tracking (new competitor URLs in rankings)
- App store crawls (developer URLs)
- Backlink fetches (source URLs)
- Manual additions via UI
Vectorize Learning (Self-Improving System)
The classification system gets smarter over time:
LLM Classification (>=80% confidence for domains, >=65% for URLs)
|
v
Generate embedding from: domain + title + description + content
|
v
Upsert to VECTORIZE_DOMAINS index with metadata:
{ domain, property_type, channel, confidence, classified_at }
|
v
Future Stage 2 queries find similar domains/URLs
|
v
Better classification without LLM cost
Key Files:
src/lib/domain-classifier.js-learnDomainClassification()at line 2481src/lib/backlink-classifier.js-learnFromClassification()at line ~90src/lib/classifier-vectorize.js-addClassifiedUrl()at line 405
DataForSEO Classification Hints
DataForSEO provides classification-useful data that we use (and could use more):
Currently Used
| Endpoint | Field | How We Use It |
|---|---|---|
| ranked_keywords | website_name | Create/upgrade brand names |
| ranked_keywords | search_intent_info.main_intent | Keyword intent classification |
| backlinks | backlink_spam_score | Risk scoring |
| backlinks | domain_from_country | Risk flags (RU, CN, VN) |
| backlinks | page_from_external_links | Link farm detection (>100 = risky) |
| summary | referring_links_platform_types | Store counts (news, blogs, etc.) |
Stage 1.5: Google Ads Categories
We cache dfs_category_path from DataForSEO and use it in classification:
/Computers & Electronics/Software->saas_product/Retailers & General Merchandise->ecommerce_store/News, Media & Publications/News->news_publisher
Untapped Opportunities
| Endpoint | Field | Potential Use |
|---|---|---|
| ranked_keywords | keyword_data.categories | Aggregate to determine domain industry |
| summary | referring_links_tld | TLD diversity = organic vs spam |
| summary | referring_links_semantic_locations | Footer-heavy = low quality |
| backlinks | text_pre, text_post | Editorial vs advertorial detection |
| backlinks | url_to_spam_score | Money page quality assessment |
Pipeline Summary
| Pipeline | Entry Point | Key Steps | Storage |
|---|---|---|---|
| App Store | Chart crawls, webhooks | HTML scrape -> iTunes -> ensureBrand -> ensureDomain -> CF Images | D1 (apps, brands, rankings) |
| SERP Tracking | Daily cron | Track positions -> ensureUrl for each result -> auto-classify | D1 (serp_positions) + ClickHouse |
| Domain Enrichment | On-demand API | 3 parallel jobs: keywords, summary, backlinks | D1 (domains, keywords, backlinks) |
| Domain Classification | Auto (via ensureDomain) | 7 stages: Cache -> Rules -> Categories -> Vectorize -> Low-Noise -> Instant -> LLM | D1 (domains.property_type, tier1_type) |
| URL Classification | Auto (via ensureUrl) | Rules -> Vectorize -> LLM per URL | D1 (urls.page_type, quality_tier) |
Classification Triggers
| Trigger | What Happens |
|---|---|
ensureDomain(url) | If domain not classified with >=60% confidence, queues to domain-classify |
ensureUrl(url) | If URL not classified with >=60% confidence, queues to backlink-classify |
POST /admin/classifier/domain | Force classify a domain (bypasses queue) |
POST /admin/domains/enrich | Fetch stats + backlinks, which triggers URL classification |
Learning Thresholds
| Type | Confidence Threshold | What Gets Learned |
|---|---|---|
| Domain | >=80% | domain + page metadata -> Vectorize |
| URL | >=65% | url + domain + classification -> Vectorize |
Only LLM classifications trigger learning (Rules/Vectorize/Content results are NOT fed back to avoid circular learning).
Cost Tiers
| Tier | Operations | Cost |
|---|---|---|
| FREE | HTML scrape, iTunes API, Rules, Vectorize, Low-Noise Crawl | $0 |
| CHEAP | Instant Pages, Domain Summary | $0.000125-0.02 |
| MODERATE | Ranked Keywords, Backlinks | $0.03-0.04 |
| EXPENSIVE | LLM calls, DataForSEO app fallback | $0.0001-0.01 |
Key Optimizations
- Auto-classification via ensureUrl/ensureDomain - Every URL/domain gets classified automatically
- Cache check first - Skip classification if already done with sufficient confidence
- DataForSEO hints at Stage 1.5 - Use cached category data before expensive stages
- Low-noise crawl before Instant Pages - 70% of domains classified without paid API
- Self-learning via Vectorize - High-confidence LLM results improve future classifications
- HTML scrape before paid APIs - Apple apps fetched free first
- Similar apps with crawl_depth=0 - Prevents infinite recursion
- 3 parallel enrichment jobs - Keywords, summary, backlinks run independently