System Overview
The highest-level view of RankFabric's data pipelines.
Master System Diagram
How URLs/Domains Get Classified
Every URL that enters the system flows through ensureUrl(), which:
- Creates/updates the
urlsrecord - Calls
ensureDomain()to create/update the domain record - Auto-queues unclassified domains to
domain-classifyqueue - Auto-queues unclassified URLs to
backlink-classifyqueue
This happens for URLs from ALL sources:
- SERP tracking (new competitor URLs in rankings)
- App store crawls (developer URLs)
- Backlink fetches (source URLs)
- Manual additions via UI
Vectorize Learning (Self-Improving System)
The classification system gets smarter over time:
LLM Classification (>=80% confidence for domains, >=65% for URLs)
|
v
Generate embedding from: domain + title + description + content
|
v
Upsert to VECTORIZE_DOMAINS index with metadata:
{ domain, property_type, channel, confidence, classified_at }
|
v
Future Stage 2 queries find similar domains/URLs
|
v
Better classification without LLM cost
Key Files:
src/lib/domain-classifier.js-learnDomainClassification()at line 2481src/lib/backlink-classifier.js-learnFromClassification()at line ~90src/lib/classifier-vectorize.js-addClassifiedUrl()at line 405
DataForSEO Classification Hints
DataForSEO provides classification-useful data that we use (and could use more):
Currently Used
| Endpoint | Field | How We Use It |
|---|---|---|
| ranked_keywords | website_name | Create/upgrade brand names |
| ranked_keywords | search_intent_info.main_intent | Keyword intent classification |
| backlinks | backlink_spam_score | Risk scoring |
| backlinks | domain_from_country | Risk flags (RU, CN, VN) |
| backlinks | page_from_external_links | Link farm detection (>100 = risky) |
| summary | referring_links_platform_types | Store counts (news, blogs, etc.) |
Stage 1.5: Google Ads Categories
We cache dfs_category_path from DataForSEO and use it in classification:
/Computers & Electronics/Software->saas_product/Retailers & General Merchandise->ecommerce_store/News, Media & Publications/News->news_publisher
Untapped Opportunities
| Endpoint | Field | Potential Use |
|---|---|---|
| ranked_keywords | keyword_data.categories | Aggregate to determine domain industry |
| summary | referring_links_tld | TLD diversity = organic vs spam |
| summary | referring_links_semantic_locations | Footer-heavy = low quality |
| backlinks | text_pre, text_post | Editorial vs advertorial detection |
| backlinks | url_to_spam_score | Money page quality assessment |
Pipeline Summary
| Pipeline | Entry Point | Key Steps | Storage |
|---|---|---|---|
| App Store | Chart crawls, webhooks | HTML scrape -> iTunes -> ensureBrand -> ensureDomain -> CF Images | D1 (apps, brands, rankings) |
| SERP Tracking | Daily cron | Track positions -> ensureUrl for each result -> auto-classify | D1 (serp_positions) + ClickHouse |
| Domain Enrichment | On-demand API | 3 parallel jobs: keywords, summary, backlinks | D1 (domains, keywords, backlinks) |
| Domain Classification | Auto (via ensureDomain) | 7 stages: Cache -> Rules -> Categories -> Vectorize -> Low-Noise -> Instant -> LLM | D1 (domains.property_type, tier1_type) |
| URL Classification | Auto (via ensureUrl) | Rules -> Vectorize -> LLM per URL | D1 (urls.page_type, quality_tier) |
Classification Triggers
| Trigger | What Happens |
|---|---|
ensureDomain(url) | If domain not classified with >=60% confidence, queues to domain-classify |
ensureUrl(url) | If URL not classified with >=60% confidence, queues to backlink-classify |
POST /admin/classifier/domain | Force classify a domain (bypasses queue) |
POST /admin/domains/enrich | Fetch stats + backlinks, which triggers URL classification |
Learning Thresholds
| Type | Confidence Threshold | What Gets Learned |
|---|---|---|
| Domain | >=80% | domain + page metadata -> Vectorize |
| URL | >=65% | url + domain + classification -> Vectorize |
Only LLM classifications trigger learning (Rules/Vectorize/Content results are NOT fed back to avoid circular learning).
Cost Tiers
| Tier | Operations | Cost |
|---|---|---|
| FREE | HTML scrape, iTunes API, Rules, Vectorize, Low-Noise Crawl | $0 |
| CHEAP | Instant Pages, Domain Summary | $0.000125-0.02 |
| MODERATE | Ranked Keywords, Backlinks | $0.03-0.04 |
| EXPENSIVE | LLM calls, DataForSEO app fallback | $0.0001-0.01 |
Key Optimizations
- Auto-classification via ensureUrl/ensureDomain - Every URL/domain gets classified automatically
- Cache check first - Skip classification if already done with sufficient confidence
- DataForSEO hints at Stage 1.5 - Use cached category data before expensive stages
- Low-noise crawl before Instant Pages - 70% of domains classified without paid API
- Self-learning via Vectorize - High-confidence LLM results improve future classifications
- HTML scrape before paid APIs - Apple apps fetched free first
- Similar apps with crawl_depth=0 - Prevents infinite recursion
- 3 parallel enrichment jobs - Keywords, summary, backlinks run independently