Architecture & Flows

High-level overview of how the RankFabric edge worker coordinates external systems. Detailed flowcharts live in docs/reference/flows.md.

System Components

Component	Role	Key Files
Cloudflare Worker	Orchestrates `/run`, `/confirm-categories`, queue consumers, diagnostics. Stores transient state in KV/R2.	`src/index.js`, `src/lib/enrich.js`, `src/lib/harvest.js`
React Client	Collects user input, polls status, renders harvest results, performs Base44 CRUD via API.	External app (see `docs/react.md`)
Base44	Canonical entities: projects, business types, categories, keywords, relationships.	`src/lib/base44-client.js`
ClickHouse	Append-only analytics store for keyword metrics and trends.	`src/lib/clickhouse.js`
DataForSEO	Provides HTML snapshots (Instant Pages), domain keyword seeds (Labs), Google Ads categories. App store data as fallback.	`src/lib/harvest.js`, `src/lib/dataforseo-api.js`
Cloudflare Images	Stores app icons to avoid hotlinking Apple/Google CDN URLs. Returns delivery URLs.	`src/lib/cloudflare-images.js`
iTunes API	Authoritative source for Apple app metadata (rating, release_date, version, size).	`src/lib/itunes-api.js`
KV / R2 / Queues	KV for budgets + run state, R2 for raw HTML, Queues for harvest processing.	Configured via `wrangler.toml`
Low-Noise Crawler	FREE homepage metadata extraction using HEAD + partial GET. Runs before paid APIs.	`src/lib/low-noise-crawler.js`
Domain Classifier	7-stage cost-optimized pipeline for domain classification.	`src/lib/domain-classifier.js`

Request Flow (Summary)

POST /run
- Check quota & idempotency in KV.
- Fetch and store HTML in R2.
- Run enrichment LLM to classify business type and categories.
- Return awaiting_category_confirmation.
POST /run/ID/confirm-categories
- Merge user input, validate by business type.
- Queue harvest_keywords.
harvest_keywords consumer
- Fetch enrichment + customization from KV.
- Merge DataForSEO domain keywords with AI-generated category keywords.
- Persist to Base44 and ClickHouse.
GET /run/ID/status
- Exposes run metadata for the React client until the harvest completes.

Data Ownership

React/Base44 owns entity lifecycles.
Worker owns operational pipelines and metrics insertion.
ClickHouse stores facts only (snapshots/trends); see docs/data-architecture.md.

Developer Touchpoints

Run diagnostics and health checks via /diagnostics/* and /test/clickhouse.
Secrets and bindings defined in wrangler.toml; update via wrangler secret put.
Use docs/backend.md for deeper backend behavior and docs/react.md for UI expectations.

Apple App Store Scraping Strategy

The worker uses a tiered approach for fetching Apple app data, optimized for cost and reliability:

Data Source Priority (Apple)

HTML Scrape (FREE) - Direct fetch to apps.apple.com using Desktop Safari UA
- Provides: similar_apps, more_apps_by_developer, primary_category, support_url, privacy_url
- Rate limited: 300-500ms delays with jitter, 10s backoff on 403/429
iTunes API (FREE) - Always runs for authoritative metadata
- Provides: rating, rating_count, release_date, version, size_bytes, description
- Falls back to ZenRows proxy if direct calls are blocked
- Rate limited: 5 req/sec max, 200ms minimum delay, 3s initial backoff
DataForSEO (PAID, ~$0.0012/app) - Last resort fallback
- Only used if both HTML scrape and iTunes API fail completely

Critical Implementation Notes

Desktop Safari UAs required - Mobile UAs trigger itms-appss:// redirect loops
Randomized headers - Header order shuffled to avoid fingerprinting
Jittered delays - All requests use randomized delays to appear human
Connection limits - CF Workers have ~6 concurrent connections; process sequentially

Google Play Strategy

DataForSEO is the primary source (no free scraping alternative)
Uses postback webhooks for async delivery

Cloudflare Images Integration

App icons are uploaded to Cloudflare Images to avoid hotlinking Apple/Google CDNs:

Upload: src/lib/cloudflare-images.js uploads icons during app enrichment
Storage: Icons stored with ID format {platform}_{app_id} for deduplication
Delivery: URLs in format https://imagedelivery.net/{account_hash}/{image_id}/public
Denormalization: icon_url stored in both apps and app_category_rankings tables
Fallback: Original URL used if CF Images upload fails

Domain Classification System

Cost-optimized 7-stage pipeline for classifying domains. FREE stages run first, PAID stages only when confidence is insufficient.

flowchart LR
    subgraph FREE["FREE Stages"]
        S0[0. Cache] --> S1[1. Rules]
        S1 --> S1_5[1.5 Google Ads<br/>Categories]
        S1_5 --> S2[2. Vectorize]
        S2 --> S3[3. Low-Noise<br/>Crawl]
    end
    
    subgraph PAID["PAID (if needed)"]
        S3 --> S4[4. Instant Pages<br/>$0.000125]
        S4 --> S4_5[4.5 Domain<br/>Patterns]
        S4_5 --> S5[5. LLM<br/>~$0.0001]
    end
    
    S3 -->|"≥70%"| DONE[Done]
    S4 -->|"≥70%"| DONE
    S5 --> DONE
    
    style S3 fill:#c8e6c9
    style S4 fill:#fff3e0
    style S5 fill:#ffcdd2

Stage Details

Stage	Name	Cost	Description
0	Cache	FREE	Check D1 for existing classification
1	Rules	FREE	TLDs (.gov/.edu), known domains, platform patterns
1.5	Google Ads Categories	FREE	Use cached DFS category data → tier1_type hint
2	Vectorize	FREE	Semantic similarity to labeled domains
3	Low-Noise Crawl	FREE	HEAD + partial GET (8KB), CMS/og:type detection
4	Instant Pages	$0.000125	DataForSEO full page fetch
4.5	Domain Patterns	FREE	Fallback rules for placeholder pages
5	LLM	~$0.0001	Workers AI for ambiguous cases

Low-Noise Crawler

The low-noise crawler (src/lib/low-noise-crawler.js) is a FREE alternative to DataForSEO Instant Pages:

Phase 1: DNS Resolution
  └─> Check root vs www, determine canonical host

Phase 2: HEAD Request  
  └─> Follow redirects, capture server headers

Phase 3: Partial GET (Range: 0-8KB)
  └─> Extract <head> only: title, description, canonical, og:*, generator
  └─> CMS detection from generator meta tag

Detection Capabilities:

CMS: WordPress, Shopify, Ghost, Hugo, Jekyll, Wix, Squarespace, Webflow, Next.js
og:type mapping: product → ecommerce, article → blog, music/video → streaming
Parked domains: "domain for sale", "coming soon" patterns
Content signals: SaaS keywords, ecommerce, news patterns

Why Low-Noise First:

FREE - No API costs
Fast - Only 8KB vs full page
Stealthy - HEAD + Range header mimics browser prefetch
Effective - Handles ~70% of domains without Instant Pages

Classification Output

Each domain gets classified with:

property_type - Specific type (saas_product, ecommerce_store, news_publisher, etc.)
tier1_type - High-level archetype (platform, commerce, service, information, etc.)
channel - Marketing channel bucket
media_type - PESO model (paid, earned, shared, owned)

See docs/domain-onboarding-flow.md for the full domain onboarding flow. See docs/backlink-intelligence.md for backlink classification details.

System Components​

Request Flow (Summary)​

Data Ownership​

Developer Touchpoints​

Apple App Store Scraping Strategy​

Data Source Priority (Apple)​

Critical Implementation Notes​

Google Play Strategy​

Cloudflare Images Integration​

Domain Classification System​

Stage Details​

Low-Noise Crawler​

Classification Output​