Skip to main content

System Overview

The highest-level view of RankFabric's data pipelines.

Master System Diagram

How URLs/Domains Get Classified

Every URL that enters the system flows through ensureUrl(), which:

  1. Creates/updates the urls record
  2. Calls ensureDomain() to create/update the domain record
  3. Auto-queues unclassified domains to domain-classify queue
  4. Auto-queues unclassified URLs to backlink-classify queue

This happens for URLs from ALL sources:

  • SERP tracking (new competitor URLs in rankings)
  • App store crawls (developer URLs)
  • Backlink fetches (source URLs)
  • Manual additions via UI

Vectorize Learning (Self-Improving System)

The classification system gets smarter over time:

LLM Classification (>=80% confidence for domains, >=65% for URLs)
|
v
Generate embedding from: domain + title + description + content
|
v
Upsert to VECTORIZE_DOMAINS index with metadata:
{ domain, property_type, channel, confidence, classified_at }
|
v
Future Stage 2 queries find similar domains/URLs
|
v
Better classification without LLM cost

Key Files:

  • src/lib/domain-classifier.js - learnDomainClassification() at line 2481
  • src/lib/backlink-classifier.js - learnFromClassification() at line ~90
  • src/lib/classifier-vectorize.js - addClassifiedUrl() at line 405

DataForSEO Classification Hints

DataForSEO provides classification-useful data that we use (and could use more):

Currently Used

EndpointFieldHow We Use It
ranked_keywordswebsite_nameCreate/upgrade brand names
ranked_keywordssearch_intent_info.main_intentKeyword intent classification
backlinksbacklink_spam_scoreRisk scoring
backlinksdomain_from_countryRisk flags (RU, CN, VN)
backlinkspage_from_external_linksLink farm detection (>100 = risky)
summaryreferring_links_platform_typesStore counts (news, blogs, etc.)

Stage 1.5: Google Ads Categories

We cache dfs_category_path from DataForSEO and use it in classification:

  • /Computers & Electronics/Software -> saas_product
  • /Retailers & General Merchandise -> ecommerce_store
  • /News, Media & Publications/News -> news_publisher

Untapped Opportunities

EndpointFieldPotential Use
ranked_keywordskeyword_data.categoriesAggregate to determine domain industry
summaryreferring_links_tldTLD diversity = organic vs spam
summaryreferring_links_semantic_locationsFooter-heavy = low quality
backlinkstext_pre, text_postEditorial vs advertorial detection
backlinksurl_to_spam_scoreMoney page quality assessment

Pipeline Summary

PipelineEntry PointKey StepsStorage
App StoreChart crawls, webhooksHTML scrape -> iTunes -> ensureBrand -> ensureDomain -> CF ImagesD1 (apps, brands, rankings)
SERP TrackingDaily cronTrack positions -> ensureUrl for each result -> auto-classifyD1 (serp_positions) + ClickHouse
Domain EnrichmentOn-demand API3 parallel jobs: keywords, summary, backlinksD1 (domains, keywords, backlinks)
Domain ClassificationAuto (via ensureDomain)7 stages: Cache -> Rules -> Categories -> Vectorize -> Low-Noise -> Instant -> LLMD1 (domains.property_type, tier1_type)
URL ClassificationAuto (via ensureUrl)Rules -> Vectorize -> LLM per URLD1 (urls.page_type, quality_tier)

Classification Triggers

TriggerWhat Happens
ensureDomain(url)If domain not classified with >=60% confidence, queues to domain-classify
ensureUrl(url)If URL not classified with >=60% confidence, queues to backlink-classify
POST /admin/classifier/domainForce classify a domain (bypasses queue)
POST /admin/domains/enrichFetch stats + backlinks, which triggers URL classification

Learning Thresholds

TypeConfidence ThresholdWhat Gets Learned
Domain>=80%domain + page metadata -> Vectorize
URL>=65%url + domain + classification -> Vectorize

Only LLM classifications trigger learning (Rules/Vectorize/Content results are NOT fed back to avoid circular learning).

Cost Tiers

TierOperationsCost
FREEHTML scrape, iTunes API, Rules, Vectorize, Low-Noise Crawl$0
CHEAPInstant Pages, Domain Summary$0.000125-0.02
MODERATERanked Keywords, Backlinks$0.03-0.04
EXPENSIVELLM calls, DataForSEO app fallback$0.0001-0.01

Key Optimizations

  1. Auto-classification via ensureUrl/ensureDomain - Every URL/domain gets classified automatically
  2. Cache check first - Skip classification if already done with sufficient confidence
  3. DataForSEO hints at Stage 1.5 - Use cached category data before expensive stages
  4. Low-noise crawl before Instant Pages - 70% of domains classified without paid API
  5. Self-learning via Vectorize - High-confidence LLM results improve future classifications
  6. HTML scrape before paid APIs - Apple apps fetched free first
  7. Similar apps with crawl_depth=0 - Prevents infinite recursion
  8. 3 parallel enrichment jobs - Keywords, summary, backlinks run independently