Architecture & Flows
High-level overview of how the RankFabric edge worker coordinates external systems. Detailed flowcharts live in docs/reference/flows.md.
System Components
| Component | Role | Key Files |
|---|---|---|
| Cloudflare Worker | Orchestrates /run, /confirm-categories, queue consumers, diagnostics. Stores transient state in KV/R2. | src/index.js, src/lib/enrich.js, src/lib/harvest.js |
| React Client | Collects user input, polls status, renders harvest results, performs Base44 CRUD via API. | External app (see docs/react.md) |
| Base44 | Canonical entities: projects, business types, categories, keywords, relationships. | src/lib/base44-client.js |
| ClickHouse | Append-only analytics store for keyword metrics and trends. | src/lib/clickhouse.js |
| DataForSEO | Provides HTML snapshots (Instant Pages), domain keyword seeds (Labs), Google Ads categories. App store data as fallback. | src/lib/harvest.js, src/lib/dataforseo-api.js |
| Cloudflare Images | Stores app icons to avoid hotlinking Apple/Google CDN URLs. Returns delivery URLs. | src/lib/cloudflare-images.js |
| iTunes API | Authoritative source for Apple app metadata (rating, release_date, version, size). | src/lib/itunes-api.js |
| KV / R2 / Queues | KV for budgets + run state, R2 for raw HTML, Queues for harvest processing. | Configured via wrangler.toml |
| Low-Noise Crawler | FREE homepage metadata extraction using HEAD + partial GET. Runs before paid APIs. | src/lib/low-noise-crawler.js |
| Domain Classifier | 7-stage cost-optimized pipeline for domain classification. | src/lib/domain-classifier.js |
Request Flow (Summary)
- POST /run
- Check quota & idempotency in KV.
- Fetch and store HTML in R2.
- Run enrichment LLM to classify business type and categories.
- Return
awaiting_category_confirmation.
- POST /run/ID/confirm-categories
- Merge user input, validate by business type.
- Queue
harvest_keywords.
- harvest_keywords consumer
- Fetch enrichment + customization from KV.
- Merge DataForSEO domain keywords with AI-generated category keywords.
- Persist to Base44 and ClickHouse.
- GET /run/ID/status
- Exposes run metadata for the React client until the harvest completes.
Data Ownership
- React/Base44 owns entity lifecycles.
- Worker owns operational pipelines and metrics insertion.
- ClickHouse stores facts only (snapshots/trends); see
docs/data-architecture.md.
Developer Touchpoints
- Run diagnostics and health checks via
/diagnostics/*and/test/clickhouse. - Secrets and bindings defined in
wrangler.toml; update viawrangler secret put. - Use
docs/backend.mdfor deeper backend behavior anddocs/react.mdfor UI expectations.
Apple App Store Scraping Strategy
The worker uses a tiered approach for fetching Apple app data, optimized for cost and reliability:
Data Source Priority (Apple)
-
HTML Scrape (FREE) - Direct fetch to
apps.apple.comusing Desktop Safari UA- Provides:
similar_apps,more_apps_by_developer,primary_category,support_url,privacy_url - Rate limited: 300-500ms delays with jitter, 10s backoff on 403/429
- Provides:
-
iTunes API (FREE) - Always runs for authoritative metadata
- Provides:
rating,rating_count,release_date,version,size_bytes,description - Falls back to ZenRows proxy if direct calls are blocked
- Rate limited: 5 req/sec max, 200ms minimum delay, 3s initial backoff
- Provides:
-
DataForSEO (PAID, ~$0.0012/app) - Last resort fallback
- Only used if both HTML scrape and iTunes API fail completely
Critical Implementation Notes
- Desktop Safari UAs required - Mobile UAs trigger
itms-appss://redirect loops - Randomized headers - Header order shuffled to avoid fingerprinting
- Jittered delays - All requests use randomized delays to appear human
- Connection limits - CF Workers have ~6 concurrent connections; process sequentially
Google Play Strategy
- DataForSEO is the primary source (no free scraping alternative)
- Uses postback webhooks for async delivery
Cloudflare Images Integration
App icons are uploaded to Cloudflare Images to avoid hotlinking Apple/Google CDNs:
- Upload:
src/lib/cloudflare-images.jsuploads icons during app enrichment - Storage: Icons stored with ID format
{platform}_{app_id}for deduplication - Delivery: URLs in format
https://imagedelivery.net/{account_hash}/{image_id}/public - Denormalization:
icon_urlstored in bothappsandapp_category_rankingstables - Fallback: Original URL used if CF Images upload fails
Domain Classification System
Cost-optimized 7-stage pipeline for classifying domains. FREE stages run first, PAID stages only when confidence is insufficient.
flowchart LR
subgraph FREE["FREE Stages"]
S0[0. Cache] --> S1[1. Rules]
S1 --> S1_5[1.5 Google Ads<br/>Categories]
S1_5 --> S2[2. Vectorize]
S2 --> S3[3. Low-Noise<br/>Crawl]
end
subgraph PAID["PAID (if needed)"]
S3 --> S4[4. Instant Pages<br/>$0.000125]
S4 --> S4_5[4.5 Domain<br/>Patterns]
S4_5 --> S5[5. LLM<br/>~$0.0001]
end
S3 -->|"≥70%"| DONE[Done]
S4 -->|"≥70%"| DONE
S5 --> DONE
style S3 fill:#c8e6c9
style S4 fill:#fff3e0
style S5 fill:#ffcdd2
Stage Details
| Stage | Name | Cost | Description |
|---|---|---|---|
| 0 | Cache | FREE | Check D1 for existing classification |
| 1 | Rules | FREE | TLDs (.gov/.edu), known domains, platform patterns |
| 1.5 | Google Ads Categories | FREE | Use cached DFS category data → tier1_type hint |
| 2 | Vectorize | FREE | Semantic similarity to labeled domains |
| 3 | Low-Noise Crawl | FREE | HEAD + partial GET (8KB), CMS/og:type detection |
| 4 | Instant Pages | $0.000125 | DataForSEO full page fetch |
| 4.5 | Domain Patterns | FREE | Fallback rules for placeholder pages |
| 5 | LLM | ~$0.0001 | Workers AI for ambiguous cases |
Low-Noise Crawler
The low-noise crawler (src/lib/low-noise-crawler.js) is a FREE alternative to DataForSEO Instant Pages:
Phase 1: DNS Resolution
└─> Check root vs www, determine canonical host
Phase 2: HEAD Request
└─> Follow redirects, capture server headers
Phase 3: Partial GET (Range: 0-8KB)
└─> Extract <head> only: title, description, canonical, og:*, generator
└─> CMS detection from generator meta tag
Detection Capabilities:
- CMS: WordPress, Shopify, Ghost, Hugo, Jekyll, Wix, Squarespace, Webflow, Next.js
- og:type mapping: product → ecommerce, article → blog, music/video → streaming
- Parked domains: "domain for sale", "coming soon" patterns
- Content signals: SaaS keywords, ecommerce, news patterns
Why Low-Noise First:
- FREE - No API costs
- Fast - Only 8KB vs full page
- Stealthy - HEAD + Range header mimics browser prefetch
- Effective - Handles ~70% of domains without Instant Pages
Classification Output
Each domain gets classified with:
property_type- Specific type (saas_product, ecommerce_store, news_publisher, etc.)tier1_type- High-level archetype (platform, commerce, service, information, etc.)channel- Marketing channel bucketmedia_type- PESO model (paid, earned, shared, owned)
See docs/domain-onboarding-flow.md for the full domain onboarding flow.
See docs/backlink-intelligence.md for backlink classification details.