System Overview

The highest-level view of RankFabric's data pipelines.

Master System Diagram

How URLs/Domains Get Classified

Every URL that enters the system flows through ensureUrl(), which:

Creates/updates the urls record
Calls ensureDomain() to create/update the domain record
Auto-queues unclassified domains to domain-classify queue
Auto-queues unclassified URLs to backlink-classify queue

This happens for URLs from ALL sources:

SERP tracking (new competitor URLs in rankings)
App store crawls (developer URLs)
Backlink fetches (source URLs)
Manual additions via UI

Vectorize Learning (Self-Improving System)

The classification system gets smarter over time:

LLM Classification (>=80% confidence for domains, >=65% for URLs)
    |
    v
Generate embedding from: domain + title + description + content
    |
    v
Upsert to VECTORIZE_DOMAINS index with metadata:
  { domain, property_type, channel, confidence, classified_at }
    |
    v
Future Stage 2 queries find similar domains/URLs
    |
    v
Better classification without LLM cost

Key Files:

src/lib/domain-classifier.js - learnDomainClassification() at line 2481
src/lib/backlink-classifier.js - learnFromClassification() at line ~90
src/lib/classifier-vectorize.js - addClassifiedUrl() at line 405

DataForSEO Classification Hints

DataForSEO provides classification-useful data that we use (and could use more):

Currently Used

Endpoint	Field	How We Use It
ranked_keywords	`website_name`	Create/upgrade brand names
ranked_keywords	`search_intent_info.main_intent`	Keyword intent classification
backlinks	`backlink_spam_score`	Risk scoring
backlinks	`domain_from_country`	Risk flags (RU, CN, VN)
backlinks	`page_from_external_links`	Link farm detection (>100 = risky)
summary	`referring_links_platform_types`	Store counts (news, blogs, etc.)

Stage 1.5: Google Ads Categories

We cache dfs_category_path from DataForSEO and use it in classification:

/Computers & Electronics/Software -> saas_product
/Retailers & General Merchandise -> ecommerce_store
/News, Media & Publications/News -> news_publisher

Untapped Opportunities

Endpoint	Field	Potential Use
ranked_keywords	`keyword_data.categories`	Aggregate to determine domain industry
summary	`referring_links_tld`	TLD diversity = organic vs spam
summary	`referring_links_semantic_locations`	Footer-heavy = low quality
backlinks	`text_pre`, `text_post`	Editorial vs advertorial detection
backlinks	`url_to_spam_score`	Money page quality assessment

Pipeline Summary

Pipeline	Entry Point	Key Steps	Storage
App Store	Chart crawls, webhooks	HTML scrape -> iTunes -> ensureBrand -> ensureDomain -> CF Images	D1 (apps, brands, rankings)
SERP Tracking	Daily cron	Track positions -> ensureUrl for each result -> auto-classify	D1 (serp_positions) + ClickHouse
Domain Enrichment	On-demand API	3 parallel jobs: keywords, summary, backlinks	D1 (domains, keywords, backlinks)
Domain Classification	Auto (via ensureDomain)	7 stages: Cache -> Rules -> Categories -> Vectorize -> Low-Noise -> Instant -> LLM	D1 (domains.property_type, tier1_type)
URL Classification	Auto (via ensureUrl)	Rules -> Vectorize -> LLM per URL	D1 (urls.page_type, quality_tier)

Classification Triggers

Trigger	What Happens
`ensureDomain(url)`	If domain not classified with >=60% confidence, queues to `domain-classify`
`ensureUrl(url)`	If URL not classified with >=60% confidence, queues to `backlink-classify`
`POST /admin/classifier/domain`	Force classify a domain (bypasses queue)
`POST /admin/domains/enrich`	Fetch stats + backlinks, which triggers URL classification

Learning Thresholds

Type	Confidence Threshold	What Gets Learned
Domain	>=80%	domain + page metadata -> Vectorize
URL	>=65%	url + domain + classification -> Vectorize

Only LLM classifications trigger learning (Rules/Vectorize/Content results are NOT fed back to avoid circular learning).

Cost Tiers

Tier	Operations	Cost
FREE	HTML scrape, iTunes API, Rules, Vectorize, Low-Noise Crawl	$0
CHEAP	Instant Pages, Domain Summary	$0.000125-0.02
MODERATE	Ranked Keywords, Backlinks	$0.03-0.04
EXPENSIVE	LLM calls, DataForSEO app fallback	$0.0001-0.01

Key Optimizations

Auto-classification via ensureUrl/ensureDomain - Every URL/domain gets classified automatically
Cache check first - Skip classification if already done with sufficient confidence
DataForSEO hints at Stage 1.5 - Use cached category data before expensive stages
Low-noise crawl before Instant Pages - 70% of domains classified without paid API
Self-learning via Vectorize - High-confidence LLM results improve future classifications
HTML scrape before paid APIs - Apple apps fetched free first
Similar apps with crawl_depth=0 - Prevents infinite recursion
3 parallel enrichment jobs - Keywords, summary, backlinks run independently

Master System Diagram​

How URLs/Domains Get Classified​

Vectorize Learning (Self-Improving System)​

DataForSEO Classification Hints​

Currently Used​

Stage 1.5: Google Ads Categories​

Untapped Opportunities​

Pipeline Summary​

Classification Triggers​

Learning Thresholds​

Cost Tiers​

Key Optimizations​