Architecture & Flows
High-level overview of how the RankDisco edge worker coordinates external systems. Detailed flowcharts live in docs/reference/flows.md.
Project Structure
The packages/api/src/ directory is organized as follows:
src/
├── index.js # Main router, registers all endpoints
├── endpoints/ # HTTP API handlers (organized by feature)
│ ├── admin/ # Administrative operations
│ ├── apps/ # App Store/Google Play endpoints
│ ├── assets/ # Static asset serving
│ ├── classification/ # URL/domain classification
│ ├── crawl/ # Web crawling endpoints
│ ├── dataforseo/ # DataForSEO webhooks
│ ├── debug/ # Debugging endpoints
│ ├── domains/ # Domain management
│ ├── keywords/ # Keyword research/tracking
│ ├── misc/ # Utility endpoints
│ ├── projects/ # Project management
│ ├── social/ # Social media scraping
│ ├── test/ # Test endpoints
│ ├── tracking/ # Ranking tracking
│ └── workflows/ # Workflow triggers
├── lib/ # Shared library code
│ ├── classification/ # Classification pipeline
│ ├── crawl/ # Crawling utilities
│ ├── dataforseo/ # DataForSEO API client
│ ├── domains/ # Domain/URL utilities
│ ├── integrations/ # Third-party clients
│ ├── keywords/ # Keyword management
│ ├── parsing/ # HTML parsing
│ ├── social/ # Social scraping
│ ├── storage/ # Storage clients
│ ├── utils/ # General utilities
│ └── workflows/ # Workflow helpers
├── queue/ # Queue consumers
├── workflows/ # Cloudflare Workflows (TypeScript)
└── data/ # Static data files
Each folder contains an AGENTS.md file documenting its purpose and contents.
System Components
| Component | Role | Key Files |
|---|---|---|
| Cloudflare Worker | Orchestrates /run, /confirm-categories, queue consumers, diagnostics. Stores transient state in KV/R2. | src/index.js, src/lib/keywords/enrich.js, src/lib/keywords/harvest.js |
| React Client | Collects user input, polls status, renders harvest results, performs Base44 CRUD via API. | External app (see docs/react.md) |
| Base44 | Canonical entities: projects, business types, categories, keywords, relationships. | src/lib/integrations/base44-client.js |
| ClickHouse | Append-only analytics store for keyword metrics and trends. | src/lib/storage/clickhouse.js |
| DataForSEO | Provides HTML snapshots (Instant Pages), domain keyword seeds (Labs), Google Ads categories. App store data as fallback. | src/lib/keywords/harvest.js, src/lib/dataforseo/dataforseo-api.js |
| Cloudflare Images | Stores app icons to avoid hotlinking Apple/Google CDN URLs. Returns delivery URLs. | src/lib/storage/cloudflare-images.js |
| iTunes API | Authoritative source for Apple app metadata (rating, release_date, version, size). | src/lib/integrations/itunes-api.js |
| KV / R2 / Queues | KV for budgets + run state, R2 for raw HTML, Queues for harvest processing. | Configured via wrangler.toml |
| Low-Noise Crawler | FREE homepage metadata extraction using HEAD + partial GET. Runs before paid APIs. | src/lib/crawl/low-noise-crawler.js |
| Domain Classifier | 7-stage cost-optimized pipeline for domain classification. | src/lib/classification/domain-classifier.js |
Request Flow (Summary)
- POST /run
- Check quota & idempotency in KV.
- Fetch and store HTML in R2.
- Run enrichment LLM to classify business type and categories.
- Return
awaiting_category_confirmation.
- POST /run/{id}/confirm-categories
- Merge user input, validate by business type.
- Queue
harvest_keywords.
- harvest_keywords consumer
- Fetch enrichment + customization from KV.
- Merge DataForSEO domain keywords with AI-generated category keywords.
- Persist to Base44 and ClickHouse.
- GET /run/{id}/status
- Exposes run metadata for the React client until the harvest completes.
Data Ownership
- React/Base44 owns entity lifecycles.
- Worker owns operational pipelines and metrics insertion.
- ClickHouse stores facts only (snapshots/trends); see
docs/data-architecture.md.
Developer Touchpoints
- Run diagnostics and health checks via
/diagnostics/*and/test/clickhouse. - Secrets and bindings defined in
wrangler.toml; update viawrangler secret put. - Use
docs/backend.mdfor deeper backend behavior anddocs/react.mdfor UI expectations.
Apple App Store Scraping Strategy
The worker uses a tiered approach for fetching Apple app data, optimized for cost and reliability:
Data Source Priority (Apple)
-
HTML Scrape (FREE) - Direct fetch to
apps.apple.comusing Desktop Safari UA- Provides:
similar_apps,more_apps_by_developer,primary_category,support_url,privacy_url - Rate limited: 300-500ms delays with jitter, 10s backoff on 403/429
- Provides:
-
iTunes API (FREE) - Always runs for authoritative metadata
- Provides:
rating,rating_count,release_date,version,size_bytes,description - Falls back to ZenRows proxy if direct calls are blocked
- Rate limited: 5 req/sec max, 200ms minimum delay, 3s initial backoff
- Provides:
-
DataForSEO (PAID, ~$0.0012/app) - Last resort fallback
- Only used if both HTML scrape and iTunes API fail completely
Critical Implementation Notes
- Desktop Safari UAs required - Mobile UAs trigger
itms-appss://redirect loops - Randomized headers - Header order shuffled to avoid fingerprinting
- Jittered delays - All requests use randomized delays to appear human
- Connection limits - CF Workers have ~6 concurrent connections; process sequentially
Google Play Strategy
- DataForSEO is the primary source (no free scraping alternative)
- Uses postback webhooks for async delivery
Cloudflare Images Integration
App icons are uploaded to Cloudflare Images to avoid hotlinking Apple/Google CDNs:
- Upload:
src/lib/cloudflare-images.jsuploads icons during app enrichment - Storage: Icons stored with ID format
{platform}_{app_id}for deduplication - Delivery: URLs in format
https://imagedelivery.net/{account_hash}/{image_id}/public - Denormalization:
icon_urlstored in bothappsandapp_category_rankingstables - Fallback: Original URL used if CF Images upload fails
Domain Classification System
Cost-optimized 7-stage pipeline for classifying domains. FREE stages run first, PAID stages only when confidence is insufficient.
Stage Details
| Stage | Name | Cost | Description |
|---|---|---|---|
| 0 | Cache | FREE | Check D1 for existing classification |
| 1 | Rules | FREE | TLDs (.gov/.edu), known domains, platform patterns |
| 1.5 | Google Ads Categories | FREE | Use cached DFS category data → tier1_type hint |
| 2 | Vectorize | FREE | Semantic similarity to labeled domains |
| 3 | Low-Noise Crawl | FREE | HEAD + partial GET (8KB), CMS/og:type detection |
| 4 | Instant Pages | $0.000125 | DataForSEO full page fetch |
| 4.5 | Domain Patterns | FREE | Fallback rules for placeholder pages |
| 5 | LLM | ~$0.0001 | Workers AI for ambiguous cases |
Low-Noise Crawler
The low-noise crawler (src/lib/low-noise-crawler.js) is a FREE alternative to DataForSEO Instant Pages:
Phase 1: DNS Resolution
└─> Check root vs www, determine canonical host
Phase 2: HEAD Request
└─> Follow redirects, capture server headers
Phase 3: Partial GET (Range: 0-8KB)
└─> Extract <head> only: title, description, canonical, og:*, generator
└─> CMS detection from generator meta tag
Detection Capabilities:
- CMS: WordPress, Shopify, Ghost, Hugo, Jekyll, Wix, Squarespace, Webflow, Next.js
- og:type mapping: product → ecommerce, article → blog, music/video → streaming
- Parked domains: "domain for sale", "coming soon" patterns
- Content signals: SaaS keywords, ecommerce, news patterns
Why Low-Noise First:
- FREE - No API costs
- Fast - Only 8KB vs full page
- Stealthy - HEAD + Range header mimics browser prefetch
- Effective - Handles ~70% of domains without Instant Pages
Classification Output
Each domain gets classified with:
property_type- Specific type (saas_product, ecommerce_store, news_publisher, etc.)tier1_type- High-level archetype (platform, commerce, service, information, etc.)channel- Marketing channel bucketmedia_type- PESO model (paid, earned, shared, owned)
See docs/domain-onboarding-flow.md for the full domain onboarding flow.
See docs/backlink-intelligence.md for backlink classification details.