Backend Implementation Guide
This file replaces the earlier backend handoffs (BACKEND_IMPLEMENTATION.md, REQUIREMENTS.md, SCHEMA_*). It captures how the worker orchestrates enrichment, harvest, and persistence across Base44, DataForSEO, and ClickHouse.
Summary
- Cloudflare Worker entrypoints:
POST /run,POST /run/{id}/confirm-categories,GET /run/{id}/status, diagnostics, and queue consumers. - State lives in KV (
DFS_*namespaces) and R2 (dfs-raw-payloads). - Harvest processing runs in the
harvest_keywordsqueue consumer and ultimately writes to Base44 + ClickHouse. - Reference diagrams live in
docs/reference/flows.md; schemas indocs/reference/schema.md; endpoint details indocs/reference/endpoints.md.
Request Lifecycle
Phase 1 – Enrichment (POST /run)
- Check budget + idempotency in KV (
DFS_BUDGETS,DFS_IDEMPOTENCY). - Fetch HTML (regex scraper first, fallback to DataForSEO Instant Pages) and persist raw payload to R2.
- Run LLM enrichment (
src/lib/enrich.js) to classifybusiness_type,business_focus, recommendhigh_level_categories, produce seed keywords and embedding. - Persist run state in KV (
DFS_RUNS) and respond with202 { status: "awaiting_category_confirmation" }.
Phase 2 – User Customization (POST /run/{id}/confirm-categories)
- Fetch stored enrichment payload and merge user form input.
- Apply type-specific validation (see table below).
- Write
user_customizationback to KV. - Enqueue
harvest_keywordswith run context; respond202 { status: "harvesting_keywords" }.
Phase 3 – Harvest & Persistence (queue consumer)
- Pull enrichment + customization from KV.
- Call DataForSEO Labs
keywords_for_sitefor domain seed keywords. - Generate AI keywords per confirmed category (
src/lib/harvest.js), annotate intents, sources, and brand flags. - Merge domain and AI keywords by normalized form; attach category paths and metrics.
- Upsert keywords and relationships in Base44 (
src/lib/base44-client.js). - Batch ClickHouse inserts (
keyword_snapshots,monthly_keyword_searches) viasrc/lib/clickhouse.js. - Update run status in KV (
harvestpayload, errors) for/statusconsumers.
Status Polling (GET /run/{id}/status)
- Reads run metadata from KV.
- Surfaces
status,enrichment,user_customization,harvest, and anyerrors. - Terminal states:
complete,failed. Intermediate:queued,awaiting_category_confirmation,harvesting_keywords.
Validation Rules
| Business type | Requirement when confirming | Error |
|---|---|---|
local | At least one location (≤ 5 recommended) | 400 "Local business type requires at least one location" |
saas, game, marketplace | App names recommended (warn if empty) | Warn only |
Others (ecommerce, service, content) | No extra fields | — |
The worker also enforces category presence and trusts Base44 slugs for consistency.
External Integrations
Base44
- Entities touched:
Keyword,KeywordCategory,ProjectKeyword,BusinessType. - Keyword payload augments fields:
sources,original_keyword_text,primary_intent,secondary_intents,brand_flag,dataforseo_category_paths,latest_trend. - Relationships store
confidence,assigned_by, and merge existing user assignments.
DataForSEO
- Instant Pages pulls JS-rendered HTML when the regex fetch is insufficient.
- Labs
keywords_for_siteseeds domain-specific keywords prior to AI generation. - App Store API provides app metadata as fallback when free sources fail.
- Credentials live in Wrangler secrets (
DATAFORSEO_LOGIN,DATAFORSEO_PASSWORD); rate caps controlled viaDATAFORSEO_LABS_LIMIT,DATAFORSEO_LABS_MAX_REQUESTS. - Category taxonomy cached in KV (
DATAFORSEO_CATEGORIES) with JSON fallback indata/dataforseo-categories.json.
iTunes API
- Free API at
https://itunes.apple.com/lookup?id={app_id}&country=us - Authoritative source for:
rating,rating_count,release_date,version,size_bytes,description - Rate limited to 5 req/sec with 200ms minimum delay between requests
- Falls back to ZenRows proxy if direct calls are blocked (Apple sometimes blocks CF Worker IPs)
- Batch endpoint supports up to 200 app IDs per request
Cloudflare Images
- Stores app icons to avoid hotlinking Apple/Google CDN URLs
- Upload endpoint:
https://api.cloudflare.com/client/v4/accounts/{account_id}/images/v1 - Delivery URL format:
https://imagedelivery.net/{account_hash}/{image_id}/public - Icons stored with ID
{platform}_{app_id}for deduplication (existing icons are reused) - Secrets:
CF_IMAGES_ACCOUNT_ID,CF_IMAGES_API_TOKEN,CF_IMAGES_ACCOUNT_HASH
ClickHouse
- Writes append-only rows to
keyword_snapshotsandmonthly_keyword_searches. - Reference tables (
customers,categories,customer_categories,category_keywords,keywords) mirror Base44 structure for analytics. src/lib/clickhouse.jsuses HTTP JSONEachRow batches;/test/clickhousevalidates connectivity and schema presence.
Storage Responsibilities
| Store | Worker writes | Worker reads | Notes |
|---|---|---|---|
| Base44 | Keywords, keyword-category links, business type assignments | Validation, existing entities | React treats Base44 as canonical source of entities. |
| ClickHouse | Snapshots & monthly aggregates | Diagnostics, reporting checks | Append-only fact tables for analytics. |
KV (DFS_*) | Budget, idempotency, run state, confirmation metadata | Status responses, queue context | Keeps lightweight metadata near the worker. |
R2 (DFS_RAW_PAYLOADS) | Raw HTML payloads | Replay/debug only | Stored by run ID for later inspection. |
Diagnostics & Tooling
GET /admin/business-types– surfaces Base44 types for UI.GET /diagnostics/dataforseo-category– resolves taxonomy IDs (?id=<number>).GET /diagnostics/dataforseo-category/fallback– lists bundled taxonomy IDs.GET /diagnostics/keyword– fetches a canonical keyword from Base44.GET /diagnostics/run/{runId}– full run state (enrichment, customization, harvest).GET /test/clickhouse– schema + connectivity health check.
Implementation References
- Worker entrypoint:
src/index.js - Enrichment helpers:
src/lib/enrich.js - Harvest pipeline:
src/lib/harvest.js - Persistence helpers:
src/lib/base44-client.js,src/lib/clickhouse.js - Queue glue:
src/lib/sync.js,src/lib/persist.js - App details consumer:
src/queue/app-details-consumer.js - iTunes API client:
src/lib/itunes-api.js - Cloudflare Images:
src/lib/cloudflare-images.js - Apple page parser:
src/lib/parse-app-page.js
Apple App Store Scraping
The app-details-consumer.js implements a tiered strategy for fetching Apple app data:
Strategy Order
- HTML Scrape (FREE) - Direct fetch to
apps.apple.com - iTunes API (FREE) - Always runs for authoritative metadata
- DataForSEO (PAID) - Last resort if both free sources fail
HTML Scrape Best Practices
- Desktop Safari UAs only - Mobile UAs cause
itms-appss://redirect loops - Randomized headers - Header order shuffled to avoid fingerprinting
- Jittered delays - 300-500ms base delay with 0-200ms random jitter
- Rate limit backoff - 10s sleep on 403/429 responses
- Sequential processing - CF Workers limited to ~6 concurrent connections
Data Source Mapping
| Field | HTML Scrape | iTunes API | RSS Feed |
|---|---|---|---|
similar_apps | Yes | No | No |
more_apps_by_developer | Yes | No | No |
primary_category | Yes (authoritative) | Partial | Yes |
rating | Yes | Yes (authoritative) | No |
rating_count | Yes | Yes (authoritative) | No |
release_date | No | Yes (authoritative) | Yes |
version | Yes | Yes | No |
size_bytes | No | Yes | No |
support_url | Yes | No | No |
privacy_url | Yes | No | No |
Queue Consumer Files
src/queue/app-details-consumer.js- Main app enrichment (HTML + iTunes + DataForSEO)src/queue/app-crawl-consumer.js- Category crawling via RSS feedssrc/queue/shelf-crawl-consumer.js- Featured shelf crawling
See also:
docs/reference/flows.mdfor sequence diagrams.docs/reference/endpoints.mdfor endpoint contracts.docs/reference/schema.mdfor ClickHouse table details.