Backend Implementation Guide

This file replaces the earlier backend handoffs (BACKEND_IMPLEMENTATION.md, REQUIREMENTS.md, SCHEMA_*). It captures how the worker orchestrates enrichment, harvest, and persistence across Base44, DataForSEO, and ClickHouse.

Summary

Cloudflare Worker entrypoints: POST /run, POST /run/{id}/confirm-categories, GET /run/{id}/status, diagnostics, and queue consumers.
State lives in KV (DFS_* namespaces) and R2 (dfs-raw-payloads).
Harvest processing runs in the harvest_keywords queue consumer and ultimately writes to Base44 + ClickHouse.
Reference diagrams live in docs/reference/flows.md; schemas in docs/reference/schema.md; endpoint details in docs/reference/endpoints.md.

Request Lifecycle

Phase 1 – Enrichment (`POST /run`)

Check budget + idempotency in KV (DFS_BUDGETS, DFS_IDEMPOTENCY).
Fetch HTML (regex scraper first, fallback to DataForSEO Instant Pages) and persist raw payload to R2.
Run LLM enrichment (src/lib/enrich.js) to classify business_type, business_focus, recommend high_level_categories, produce seed keywords and embedding.
Persist run state in KV (DFS_RUNS) and respond with 202 { status: "awaiting_category_confirmation" }.

Phase 2 – User Customization (`POST /run/{id}/confirm-categories`)

Fetch stored enrichment payload and merge user form input.
Apply type-specific validation (see table below).
Write user_customization back to KV.
Enqueue harvest_keywords with run context; respond 202 { status: "harvesting_keywords" }.

Phase 3 – Harvest & Persistence (queue consumer)

Pull enrichment + customization from KV.
Call DataForSEO Labs keywords_for_site for domain seed keywords.
Generate AI keywords per confirmed category (src/lib/harvest.js), annotate intents, sources, and brand flags.
Merge domain and AI keywords by normalized form; attach category paths and metrics.
Upsert keywords and relationships in Base44 (src/lib/base44-client.js).
Batch ClickHouse inserts (keyword_snapshots, monthly_keyword_searches) via src/lib/clickhouse.js.
Update run status in KV (harvest payload, errors) for /status consumers.

Status Polling (`GET /run/{id}/status`)

Reads run metadata from KV.
Surfaces status, enrichment, user_customization, harvest, and any errors.
Terminal states: complete, failed. Intermediate: queued, awaiting_category_confirmation, harvesting_keywords.

Validation Rules

Business type	Requirement when confirming	Error
`local`	At least one location (≤ 5 recommended)	`400` `"Local business type requires at least one location"`
`saas`, `game`, `marketplace`	App names recommended (warn if empty)	Warn only
Others (`ecommerce`, `service`, `content`)	No extra fields	—

The worker also enforces category presence and trusts Base44 slugs for consistency.

External Integrations

Base44

Entities touched: Keyword, KeywordCategory, ProjectKeyword, BusinessType.
Keyword payload augments fields: sources, original_keyword_text, primary_intent, secondary_intents, brand_flag, dataforseo_category_paths, latest_trend.
Relationships store confidence, assigned_by, and merge existing user assignments.

DataForSEO

Instant Pages pulls JS-rendered HTML when the regex fetch is insufficient.
Labs keywords_for_site seeds domain-specific keywords prior to AI generation.
App Store API provides app metadata as fallback when free sources fail.
Credentials live in Wrangler secrets (DATAFORSEO_LOGIN, DATAFORSEO_PASSWORD); rate caps controlled via DATAFORSEO_LABS_LIMIT, DATAFORSEO_LABS_MAX_REQUESTS.
Category taxonomy cached in KV (DATAFORSEO_CATEGORIES) with JSON fallback in data/dataforseo-categories.json.

iTunes API

Free API at https://itunes.apple.com/lookup?id={app_id}&country=us
Authoritative source for: rating, rating_count, release_date, version, size_bytes, description
Rate limited to 5 req/sec with 200ms minimum delay between requests
Falls back to ZenRows proxy if direct calls are blocked (Apple sometimes blocks CF Worker IPs)
Batch endpoint supports up to 200 app IDs per request

Cloudflare Images

Stores app icons to avoid hotlinking Apple/Google CDN URLs
Upload endpoint: https://api.cloudflare.com/client/v4/accounts/{account_id}/images/v1
Delivery URL format: https://imagedelivery.net/{account_hash}/{image_id}/public
Icons stored with ID {platform}_{app_id} for deduplication (existing icons are reused)
Secrets: CF_IMAGES_ACCOUNT_ID, CF_IMAGES_API_TOKEN, CF_IMAGES_ACCOUNT_HASH

ClickHouse

Writes append-only rows to keyword_snapshots and monthly_keyword_searches.
Reference tables (customers, categories, customer_categories, category_keywords, keywords) mirror Base44 structure for analytics.
src/lib/clickhouse.js uses HTTP JSONEachRow batches; /test/clickhouse validates connectivity and schema presence.

Storage Responsibilities

Store	Worker writes	Worker reads	Notes
Base44	Keywords, keyword-category links, business type assignments	Validation, existing entities	React treats Base44 as canonical source of entities.
ClickHouse	Snapshots & monthly aggregates	Diagnostics, reporting checks	Append-only fact tables for analytics.
KV (`DFS_*`)	Budget, idempotency, run state, confirmation metadata	Status responses, queue context	Keeps lightweight metadata near the worker.
R2 (`DFS_RAW_PAYLOADS`)	Raw HTML payloads	Replay/debug only	Stored by run ID for later inspection.

Diagnostics & Tooling

GET /admin/business-types – surfaces Base44 types for UI.
GET /diagnostics/dataforseo-category – resolves taxonomy IDs (?id=<number>).
GET /diagnostics/dataforseo-category/fallback – lists bundled taxonomy IDs.
GET /diagnostics/keyword – fetches a canonical keyword from Base44.
GET /diagnostics/run/{runId} – full run state (enrichment, customization, harvest).
GET /test/clickhouse – schema + connectivity health check.

Implementation References

Worker entrypoint: src/index.js
Enrichment helpers: src/lib/enrich.js
Harvest pipeline: src/lib/harvest.js
Persistence helpers: src/lib/base44-client.js, src/lib/clickhouse.js
Queue glue: src/lib/sync.js, src/lib/persist.js
App details consumer: src/queue/app-details-consumer.js
iTunes API client: src/lib/itunes-api.js
Cloudflare Images: src/lib/cloudflare-images.js
Apple page parser: src/lib/parse-app-page.js

Apple App Store Scraping

The app-details-consumer.js implements a tiered strategy for fetching Apple app data:

Strategy Order

HTML Scrape (FREE) - Direct fetch to apps.apple.com
iTunes API (FREE) - Always runs for authoritative metadata
DataForSEO (PAID) - Last resort if both free sources fail

HTML Scrape Best Practices

Desktop Safari UAs only - Mobile UAs cause itms-appss:// redirect loops
Randomized headers - Header order shuffled to avoid fingerprinting
Jittered delays - 300-500ms base delay with 0-200ms random jitter
Rate limit backoff - 10s sleep on 403/429 responses
Sequential processing - CF Workers limited to ~6 concurrent connections

Data Source Mapping

Field	HTML Scrape	iTunes API	RSS Feed
`similar_apps`	Yes	No	No
`more_apps_by_developer`	Yes	No	No
`primary_category`	Yes (authoritative)	Partial	Yes
`rating`	Yes	Yes (authoritative)	No
`rating_count`	Yes	Yes (authoritative)	No
`release_date`	No	Yes (authoritative)	Yes
`version`	Yes	Yes	No
`size_bytes`	No	Yes	No
`support_url`	Yes	No	No
`privacy_url`	Yes	No	No

Queue Consumer Files

src/queue/app-details-consumer.js - Main app enrichment (HTML + iTunes + DataForSEO)
src/queue/app-crawl-consumer.js - Category crawling via RSS feeds
src/queue/shelf-crawl-consumer.js - Featured shelf crawling

Summary​

Request Lifecycle​

Phase 1 – Enrichment (POST /run)​

Phase 2 – User Customization (POST /run/{id}/confirm-categories)​

Phase 3 – Harvest & Persistence (queue consumer)​

Status Polling (GET /run/{id}/status)​

Validation Rules​

External Integrations​

Base44​

DataForSEO​

iTunes API​

Cloudflare Images​

ClickHouse​

Storage Responsibilities​

Diagnostics & Tooling​

Implementation References​

Apple App Store Scraping​

Strategy Order​

HTML Scrape Best Practices​

Data Source Mapping​

Queue Consumer Files​