Skip to main content

Architecture & Flows

High-level overview of how the RankFabric edge worker coordinates external systems. Detailed flowcharts live in docs/reference/flows.md.

System Components

ComponentRoleKey Files
Cloudflare WorkerOrchestrates /run, /confirm-categories, queue consumers, diagnostics. Stores transient state in KV/R2.src/index.js, src/lib/enrich.js, src/lib/harvest.js
React ClientCollects user input, polls status, renders harvest results, performs Base44 CRUD via API.External app (see docs/react.md)
Base44Canonical entities: projects, business types, categories, keywords, relationships.src/lib/base44-client.js
ClickHouseAppend-only analytics store for keyword metrics and trends.src/lib/clickhouse.js
DataForSEOProvides HTML snapshots (Instant Pages), domain keyword seeds (Labs), Google Ads categories. App store data as fallback.src/lib/harvest.js, src/lib/dataforseo-api.js
Cloudflare ImagesStores app icons to avoid hotlinking Apple/Google CDN URLs. Returns delivery URLs.src/lib/cloudflare-images.js
iTunes APIAuthoritative source for Apple app metadata (rating, release_date, version, size).src/lib/itunes-api.js
KV / R2 / QueuesKV for budgets + run state, R2 for raw HTML, Queues for harvest processing.Configured via wrangler.toml
Low-Noise CrawlerFREE homepage metadata extraction using HEAD + partial GET. Runs before paid APIs.src/lib/low-noise-crawler.js
Domain Classifier7-stage cost-optimized pipeline for domain classification.src/lib/domain-classifier.js

Request Flow (Summary)

  1. POST /run
    • Check quota & idempotency in KV.
    • Fetch and store HTML in R2.
    • Run enrichment LLM to classify business type and categories.
    • Return awaiting_category_confirmation.
  2. POST /run/ID/confirm-categories
    • Merge user input, validate by business type.
    • Queue harvest_keywords.
  3. harvest_keywords consumer
    • Fetch enrichment + customization from KV.
    • Merge DataForSEO domain keywords with AI-generated category keywords.
    • Persist to Base44 and ClickHouse.
  4. GET /run/ID/status
    • Exposes run metadata for the React client until the harvest completes.

Data Ownership

  • React/Base44 owns entity lifecycles.
  • Worker owns operational pipelines and metrics insertion.
  • ClickHouse stores facts only (snapshots/trends); see docs/data-architecture.md.

Developer Touchpoints

  • Run diagnostics and health checks via /diagnostics/* and /test/clickhouse.
  • Secrets and bindings defined in wrangler.toml; update via wrangler secret put.
  • Use docs/backend.md for deeper backend behavior and docs/react.md for UI expectations.

Apple App Store Scraping Strategy

The worker uses a tiered approach for fetching Apple app data, optimized for cost and reliability:

Data Source Priority (Apple)

  1. HTML Scrape (FREE) - Direct fetch to apps.apple.com using Desktop Safari UA

    • Provides: similar_apps, more_apps_by_developer, primary_category, support_url, privacy_url
    • Rate limited: 300-500ms delays with jitter, 10s backoff on 403/429
  2. iTunes API (FREE) - Always runs for authoritative metadata

    • Provides: rating, rating_count, release_date, version, size_bytes, description
    • Falls back to ZenRows proxy if direct calls are blocked
    • Rate limited: 5 req/sec max, 200ms minimum delay, 3s initial backoff
  3. DataForSEO (PAID, ~$0.0012/app) - Last resort fallback

    • Only used if both HTML scrape and iTunes API fail completely

Critical Implementation Notes

  • Desktop Safari UAs required - Mobile UAs trigger itms-appss:// redirect loops
  • Randomized headers - Header order shuffled to avoid fingerprinting
  • Jittered delays - All requests use randomized delays to appear human
  • Connection limits - CF Workers have ~6 concurrent connections; process sequentially

Google Play Strategy

  • DataForSEO is the primary source (no free scraping alternative)
  • Uses postback webhooks for async delivery

Cloudflare Images Integration

App icons are uploaded to Cloudflare Images to avoid hotlinking Apple/Google CDNs:

  • Upload: src/lib/cloudflare-images.js uploads icons during app enrichment
  • Storage: Icons stored with ID format {platform}_{app_id} for deduplication
  • Delivery: URLs in format https://imagedelivery.net/{account_hash}/{image_id}/public
  • Denormalization: icon_url stored in both apps and app_category_rankings tables
  • Fallback: Original URL used if CF Images upload fails

Domain Classification System

Cost-optimized 7-stage pipeline for classifying domains. FREE stages run first, PAID stages only when confidence is insufficient.

flowchart LR
subgraph FREE["FREE Stages"]
S0[0. Cache] --> S1[1. Rules]
S1 --> S1_5[1.5 Google Ads<br/>Categories]
S1_5 --> S2[2. Vectorize]
S2 --> S3[3. Low-Noise<br/>Crawl]
end

subgraph PAID["PAID (if needed)"]
S3 --> S4[4. Instant Pages<br/>$0.000125]
S4 --> S4_5[4.5 Domain<br/>Patterns]
S4_5 --> S5[5. LLM<br/>~$0.0001]
end

S3 -->|"≥70%"| DONE[Done]
S4 -->|"≥70%"| DONE
S5 --> DONE

style S3 fill:#c8e6c9
style S4 fill:#fff3e0
style S5 fill:#ffcdd2

Stage Details

StageNameCostDescription
0CacheFREECheck D1 for existing classification
1RulesFREETLDs (.gov/.edu), known domains, platform patterns
1.5Google Ads CategoriesFREEUse cached DFS category data → tier1_type hint
2VectorizeFREESemantic similarity to labeled domains
3Low-Noise CrawlFREEHEAD + partial GET (8KB), CMS/og:type detection
4Instant Pages$0.000125DataForSEO full page fetch
4.5Domain PatternsFREEFallback rules for placeholder pages
5LLM~$0.0001Workers AI for ambiguous cases

Low-Noise Crawler

The low-noise crawler (src/lib/low-noise-crawler.js) is a FREE alternative to DataForSEO Instant Pages:

Phase 1: DNS Resolution
└─> Check root vs www, determine canonical host

Phase 2: HEAD Request
└─> Follow redirects, capture server headers

Phase 3: Partial GET (Range: 0-8KB)
└─> Extract <head> only: title, description, canonical, og:*, generator
└─> CMS detection from generator meta tag

Detection Capabilities:

  • CMS: WordPress, Shopify, Ghost, Hugo, Jekyll, Wix, Squarespace, Webflow, Next.js
  • og:type mapping: product → ecommerce, article → blog, music/video → streaming
  • Parked domains: "domain for sale", "coming soon" patterns
  • Content signals: SaaS keywords, ecommerce, news patterns

Why Low-Noise First:

  • FREE - No API costs
  • Fast - Only 8KB vs full page
  • Stealthy - HEAD + Range header mimics browser prefetch
  • Effective - Handles ~70% of domains without Instant Pages

Classification Output

Each domain gets classified with:

  • property_type - Specific type (saas_product, ecommerce_store, news_publisher, etc.)
  • tier1_type - High-level archetype (platform, commerce, service, information, etc.)
  • channel - Marketing channel bucket
  • media_type - PESO model (paid, earned, shared, owned)

See docs/domain-onboarding-flow.md for the full domain onboarding flow. See docs/backlink-intelligence.md for backlink classification details.