Skip to main content

Backend Implementation Guide

This file replaces the earlier backend handoffs (BACKEND_IMPLEMENTATION.md, REQUIREMENTS.md, SCHEMA_*). It captures how the worker orchestrates enrichment, harvest, and persistence across Base44, DataForSEO, and ClickHouse.

Summary

  • Cloudflare Worker entrypoints: POST /run, POST /run/{id}/confirm-categories, GET /run/{id}/status, diagnostics, and queue consumers.
  • State lives in KV (DFS_* namespaces) and R2 (dfs-raw-payloads).
  • Harvest processing runs in the harvest_keywords queue consumer and ultimately writes to Base44 + ClickHouse.
  • Reference diagrams live in docs/reference/flows.md; schemas in docs/reference/schema.md; endpoint details in docs/reference/endpoints.md.

Request Lifecycle

Phase 1 – Enrichment (POST /run)

  1. Check budget + idempotency in KV (DFS_BUDGETS, DFS_IDEMPOTENCY).
  2. Fetch HTML (regex scraper first, fallback to DataForSEO Instant Pages) and persist raw payload to R2.
  3. Run LLM enrichment (src/lib/enrich.js) to classify business_type, business_focus, recommend high_level_categories, produce seed keywords and embedding.
  4. Persist run state in KV (DFS_RUNS) and respond with 202 { status: "awaiting_category_confirmation" }.

Phase 2 – User Customization (POST /run/{id}/confirm-categories)

  1. Fetch stored enrichment payload and merge user form input.
  2. Apply type-specific validation (see table below).
  3. Write user_customization back to KV.
  4. Enqueue harvest_keywords with run context; respond 202 { status: "harvesting_keywords" }.

Phase 3 – Harvest & Persistence (queue consumer)

  1. Pull enrichment + customization from KV.
  2. Call DataForSEO Labs keywords_for_site for domain seed keywords.
  3. Generate AI keywords per confirmed category (src/lib/harvest.js), annotate intents, sources, and brand flags.
  4. Merge domain and AI keywords by normalized form; attach category paths and metrics.
  5. Upsert keywords and relationships in Base44 (src/lib/base44-client.js).
  6. Batch ClickHouse inserts (keyword_snapshots, monthly_keyword_searches) via src/lib/clickhouse.js.
  7. Update run status in KV (harvest payload, errors) for /status consumers.

Status Polling (GET /run/{id}/status)

  • Reads run metadata from KV.
  • Surfaces status, enrichment, user_customization, harvest, and any errors.
  • Terminal states: complete, failed. Intermediate: queued, awaiting_category_confirmation, harvesting_keywords.

Validation Rules

Business typeRequirement when confirmingError
localAt least one location (≤ 5 recommended)400 "Local business type requires at least one location"
saas, game, marketplaceApp names recommended (warn if empty)Warn only
Others (ecommerce, service, content)No extra fields

The worker also enforces category presence and trusts Base44 slugs for consistency.

External Integrations

Base44

  • Entities touched: Keyword, KeywordCategory, ProjectKeyword, BusinessType.
  • Keyword payload augments fields: sources, original_keyword_text, primary_intent, secondary_intents, brand_flag, dataforseo_category_paths, latest_trend.
  • Relationships store confidence, assigned_by, and merge existing user assignments.

DataForSEO

  • Instant Pages pulls JS-rendered HTML when the regex fetch is insufficient.
  • Labs keywords_for_site seeds domain-specific keywords prior to AI generation.
  • App Store API provides app metadata as fallback when free sources fail.
  • Credentials live in Wrangler secrets (DATAFORSEO_LOGIN, DATAFORSEO_PASSWORD); rate caps controlled via DATAFORSEO_LABS_LIMIT, DATAFORSEO_LABS_MAX_REQUESTS.
  • Category taxonomy cached in KV (DATAFORSEO_CATEGORIES) with JSON fallback in data/dataforseo-categories.json.

iTunes API

  • Free API at https://itunes.apple.com/lookup?id={app_id}&country=us
  • Authoritative source for: rating, rating_count, release_date, version, size_bytes, description
  • Rate limited to 5 req/sec with 200ms minimum delay between requests
  • Falls back to ZenRows proxy if direct calls are blocked (Apple sometimes blocks CF Worker IPs)
  • Batch endpoint supports up to 200 app IDs per request

Cloudflare Images

  • Stores app icons to avoid hotlinking Apple/Google CDN URLs
  • Upload endpoint: https://api.cloudflare.com/client/v4/accounts/{account_id}/images/v1
  • Delivery URL format: https://imagedelivery.net/{account_hash}/{image_id}/public
  • Icons stored with ID {platform}_{app_id} for deduplication (existing icons are reused)
  • Secrets: CF_IMAGES_ACCOUNT_ID, CF_IMAGES_API_TOKEN, CF_IMAGES_ACCOUNT_HASH

ClickHouse

  • Writes append-only rows to keyword_snapshots and monthly_keyword_searches.
  • Reference tables (customers, categories, customer_categories, category_keywords, keywords) mirror Base44 structure for analytics.
  • src/lib/clickhouse.js uses HTTP JSONEachRow batches; /test/clickhouse validates connectivity and schema presence.

Storage Responsibilities

StoreWorker writesWorker readsNotes
Base44Keywords, keyword-category links, business type assignmentsValidation, existing entitiesReact treats Base44 as canonical source of entities.
ClickHouseSnapshots & monthly aggregatesDiagnostics, reporting checksAppend-only fact tables for analytics.
KV (DFS_*)Budget, idempotency, run state, confirmation metadataStatus responses, queue contextKeeps lightweight metadata near the worker.
R2 (DFS_RAW_PAYLOADS)Raw HTML payloadsReplay/debug onlyStored by run ID for later inspection.

Diagnostics & Tooling

  • GET /admin/business-types – surfaces Base44 types for UI.
  • GET /diagnostics/dataforseo-category – resolves taxonomy IDs (?id=<number>).
  • GET /diagnostics/dataforseo-category/fallback – lists bundled taxonomy IDs.
  • GET /diagnostics/keyword – fetches a canonical keyword from Base44.
  • GET /diagnostics/run/{runId} – full run state (enrichment, customization, harvest).
  • GET /test/clickhouse – schema + connectivity health check.

Implementation References

  • Worker entrypoint: src/index.js
  • Enrichment helpers: src/lib/enrich.js
  • Harvest pipeline: src/lib/harvest.js
  • Persistence helpers: src/lib/base44-client.js, src/lib/clickhouse.js
  • Queue glue: src/lib/sync.js, src/lib/persist.js
  • App details consumer: src/queue/app-details-consumer.js
  • iTunes API client: src/lib/itunes-api.js
  • Cloudflare Images: src/lib/cloudflare-images.js
  • Apple page parser: src/lib/parse-app-page.js

Apple App Store Scraping

The app-details-consumer.js implements a tiered strategy for fetching Apple app data:

Strategy Order

  1. HTML Scrape (FREE) - Direct fetch to apps.apple.com
  2. iTunes API (FREE) - Always runs for authoritative metadata
  3. DataForSEO (PAID) - Last resort if both free sources fail

HTML Scrape Best Practices

  • Desktop Safari UAs only - Mobile UAs cause itms-appss:// redirect loops
  • Randomized headers - Header order shuffled to avoid fingerprinting
  • Jittered delays - 300-500ms base delay with 0-200ms random jitter
  • Rate limit backoff - 10s sleep on 403/429 responses
  • Sequential processing - CF Workers limited to ~6 concurrent connections

Data Source Mapping

FieldHTML ScrapeiTunes APIRSS Feed
similar_appsYesNoNo
more_apps_by_developerYesNoNo
primary_categoryYes (authoritative)PartialYes
ratingYesYes (authoritative)No
rating_countYesYes (authoritative)No
release_dateNoYes (authoritative)Yes
versionYesYesNo
size_bytesNoYesNo
support_urlYesNoNo
privacy_urlYesNoNo

Queue Consumer Files

  • src/queue/app-details-consumer.js - Main app enrichment (HTML + iTunes + DataForSEO)
  • src/queue/app-crawl-consumer.js - Category crawling via RSS feeds
  • src/queue/shelf-crawl-consumer.js - Featured shelf crawling

See also:

  • docs/reference/flows.md for sequence diagrams.
  • docs/reference/endpoints.md for endpoint contracts.
  • docs/reference/schema.md for ClickHouse table details.