Skip to main content

Architecture & Flows

High-level overview of how the RankDisco edge worker coordinates external systems. Detailed flowcharts live in docs/reference/flows.md.

Project Structure

The packages/api/src/ directory is organized as follows:

src/
├── index.js # Main router, registers all endpoints
├── endpoints/ # HTTP API handlers (organized by feature)
│ ├── admin/ # Administrative operations
│ ├── apps/ # App Store/Google Play endpoints
│ ├── assets/ # Static asset serving
│ ├── classification/ # URL/domain classification
│ ├── crawl/ # Web crawling endpoints
│ ├── dataforseo/ # DataForSEO webhooks
│ ├── debug/ # Debugging endpoints
│ ├── domains/ # Domain management
│ ├── keywords/ # Keyword research/tracking
│ ├── misc/ # Utility endpoints
│ ├── projects/ # Project management
│ ├── social/ # Social media scraping
│ ├── test/ # Test endpoints
│ ├── tracking/ # Ranking tracking
│ └── workflows/ # Workflow triggers
├── lib/ # Shared library code
│ ├── classification/ # Classification pipeline
│ ├── crawl/ # Crawling utilities
│ ├── dataforseo/ # DataForSEO API client
│ ├── domains/ # Domain/URL utilities
│ ├── integrations/ # Third-party clients
│ ├── keywords/ # Keyword management
│ ├── parsing/ # HTML parsing
│ ├── social/ # Social scraping
│ ├── storage/ # Storage clients
│ ├── utils/ # General utilities
│ └── workflows/ # Workflow helpers
├── queue/ # Queue consumers
├── workflows/ # Cloudflare Workflows (TypeScript)
└── data/ # Static data files

Each folder contains an AGENTS.md file documenting its purpose and contents.

System Components

ComponentRoleKey Files
Cloudflare WorkerOrchestrates /run, /confirm-categories, queue consumers, diagnostics. Stores transient state in KV/R2.src/index.js, src/lib/keywords/enrich.js, src/lib/keywords/harvest.js
React ClientCollects user input, polls status, renders harvest results, performs Base44 CRUD via API.External app (see docs/react.md)
Base44Canonical entities: projects, business types, categories, keywords, relationships.src/lib/integrations/base44-client.js
ClickHouseAppend-only analytics store for keyword metrics and trends.src/lib/storage/clickhouse.js
DataForSEOProvides HTML snapshots (Instant Pages), domain keyword seeds (Labs), Google Ads categories. App store data as fallback.src/lib/keywords/harvest.js, src/lib/dataforseo/dataforseo-api.js
Cloudflare ImagesStores app icons to avoid hotlinking Apple/Google CDN URLs. Returns delivery URLs.src/lib/storage/cloudflare-images.js
iTunes APIAuthoritative source for Apple app metadata (rating, release_date, version, size).src/lib/integrations/itunes-api.js
KV / R2 / QueuesKV for budgets + run state, R2 for raw HTML, Queues for harvest processing.Configured via wrangler.toml
Low-Noise CrawlerFREE homepage metadata extraction using HEAD + partial GET. Runs before paid APIs.src/lib/crawl/low-noise-crawler.js
Domain Classifier7-stage cost-optimized pipeline for domain classification.src/lib/classification/domain-classifier.js

Request Flow (Summary)

  1. POST /run
    • Check quota & idempotency in KV.
    • Fetch and store HTML in R2.
    • Run enrichment LLM to classify business type and categories.
    • Return awaiting_category_confirmation.
  2. POST /run/{id}/confirm-categories
    • Merge user input, validate by business type.
    • Queue harvest_keywords.
  3. harvest_keywords consumer
    • Fetch enrichment + customization from KV.
    • Merge DataForSEO domain keywords with AI-generated category keywords.
    • Persist to Base44 and ClickHouse.
  4. GET /run/{id}/status
    • Exposes run metadata for the React client until the harvest completes.

Data Ownership

  • React/Base44 owns entity lifecycles.
  • Worker owns operational pipelines and metrics insertion.
  • ClickHouse stores facts only (snapshots/trends); see docs/data-architecture.md.

Developer Touchpoints

  • Run diagnostics and health checks via /diagnostics/* and /test/clickhouse.
  • Secrets and bindings defined in wrangler.toml; update via wrangler secret put.
  • Use docs/backend.md for deeper backend behavior and docs/react.md for UI expectations.

Apple App Store Scraping Strategy

The worker uses a tiered approach for fetching Apple app data, optimized for cost and reliability:

Data Source Priority (Apple)

  1. HTML Scrape (FREE) - Direct fetch to apps.apple.com using Desktop Safari UA

    • Provides: similar_apps, more_apps_by_developer, primary_category, support_url, privacy_url
    • Rate limited: 300-500ms delays with jitter, 10s backoff on 403/429
  2. iTunes API (FREE) - Always runs for authoritative metadata

    • Provides: rating, rating_count, release_date, version, size_bytes, description
    • Falls back to ZenRows proxy if direct calls are blocked
    • Rate limited: 5 req/sec max, 200ms minimum delay, 3s initial backoff
  3. DataForSEO (PAID, ~$0.0012/app) - Last resort fallback

    • Only used if both HTML scrape and iTunes API fail completely

Critical Implementation Notes

  • Desktop Safari UAs required - Mobile UAs trigger itms-appss:// redirect loops
  • Randomized headers - Header order shuffled to avoid fingerprinting
  • Jittered delays - All requests use randomized delays to appear human
  • Connection limits - CF Workers have ~6 concurrent connections; process sequentially

Google Play Strategy

  • DataForSEO is the primary source (no free scraping alternative)
  • Uses postback webhooks for async delivery

Cloudflare Images Integration

App icons are uploaded to Cloudflare Images to avoid hotlinking Apple/Google CDNs:

  • Upload: src/lib/cloudflare-images.js uploads icons during app enrichment
  • Storage: Icons stored with ID format {platform}_{app_id} for deduplication
  • Delivery: URLs in format https://imagedelivery.net/{account_hash}/{image_id}/public
  • Denormalization: icon_url stored in both apps and app_category_rankings tables
  • Fallback: Original URL used if CF Images upload fails

Domain Classification System

Cost-optimized 7-stage pipeline for classifying domains. FREE stages run first, PAID stages only when confidence is insufficient.

Stage Details

StageNameCostDescription
0CacheFREECheck D1 for existing classification
1RulesFREETLDs (.gov/.edu), known domains, platform patterns
1.5Google Ads CategoriesFREEUse cached DFS category data → tier1_type hint
2VectorizeFREESemantic similarity to labeled domains
3Low-Noise CrawlFREEHEAD + partial GET (8KB), CMS/og:type detection
4Instant Pages$0.000125DataForSEO full page fetch
4.5Domain PatternsFREEFallback rules for placeholder pages
5LLM~$0.0001Workers AI for ambiguous cases

Low-Noise Crawler

The low-noise crawler (src/lib/low-noise-crawler.js) is a FREE alternative to DataForSEO Instant Pages:

Phase 1: DNS Resolution
└─> Check root vs www, determine canonical host

Phase 2: HEAD Request
└─> Follow redirects, capture server headers

Phase 3: Partial GET (Range: 0-8KB)
└─> Extract <head> only: title, description, canonical, og:*, generator
└─> CMS detection from generator meta tag

Detection Capabilities:

  • CMS: WordPress, Shopify, Ghost, Hugo, Jekyll, Wix, Squarespace, Webflow, Next.js
  • og:type mapping: product → ecommerce, article → blog, music/video → streaming
  • Parked domains: "domain for sale", "coming soon" patterns
  • Content signals: SaaS keywords, ecommerce, news patterns

Why Low-Noise First:

  • FREE - No API costs
  • Fast - Only 8KB vs full page
  • Stealthy - HEAD + Range header mimics browser prefetch
  • Effective - Handles ~70% of domains without Instant Pages

Classification Output

Each domain gets classified with:

  • property_type - Specific type (saas_product, ecommerce_store, news_publisher, etc.)
  • tier1_type - High-level archetype (platform, commerce, service, information, etc.)
  • channel - Marketing channel bucket
  • media_type - PESO model (paid, earned, shared, owned)

See docs/domain-onboarding-flow.md for the full domain onboarding flow. See docs/backlink-intelligence.md for backlink classification details.