Data Architecture & Redundancy
Why Base44 and ClickHouse can coexist without duplicating sources of truth.
Layer Responsibilities
| Layer | Owns | Writes | Reads |
|---|---|---|---|
| React + Base44 | Entities (projects, categories, keywords, business types, relationships) | React + worker upserts entities through Base44 API. | React UI, worker validation. |
| Cloudflare Worker | No authoritative data | Orchestrates enrichment → persistence. | Reads & writes KV/R2, Base44, ClickHouse, DataForSEO. |
| Cloudflare D1 | App store data (apps, rankings, categories, brands) | Worker writes via queue consumers. | API endpoints, React UI. |
| Cloudflare Images | App icons (uploaded from Apple/Google CDNs) | Worker uploads during app enrichment. | Frontend via delivery URLs. |
| ClickHouse | Time-series facts & analytical caches | Worker appends snapshots + monthly aggregates. | Reporting, dashboards, diagnostics. |
The worker never mutates data directly inside React/Base44; it uses authenticated API calls so Base44 remains canonical. ClickHouse never pushes back—everything flows Base44 → ClickHouse.
Base44 (Source of Truth)
- Stores entity definitions with metadata (names, slugs, colors, owners).
- Relationships (
category_keywords,project_keywords) live here, includingconfidence,assigned_by, tags, etc. - Business types, confirmed categories, and user edits are persisted via worker API calls.
- Any edit made in React is immediately authoritative; ClickHouse copies are refreshed but not the master.
ClickHouse (Analytics Store)
- Houses append-only fact tables:
keyword_snapshots– daily/hourly metrics per keyword/category/customer.monthly_keyword_searches– aggregated trend history.
- Mirrors Base44 reference tables (
customers,categories,keywords,category_keywords,customer_categories) strictly for query speed. They are populated via nightly syncs or worker upserts, not user edits. - Denormalization in fact tables is intentional (keyword text, search volume) to keep queries fast; replacement comes from Base44 + DataForSEO.
Worker Glue
- Enrichment – fetch HTML, run LLM, store metadata in KV (not authoritative).
- Harvest – combine DataForSEO + AI generated keywords.
- Persistence – upsert entities in Base44, append facts to ClickHouse, leave R2/KV breadcrumbs for observability.
- App Enrichment – fetch app details via HTML scrape + iTunes API, upload icons to CF Images, save to D1.
- Diagnostics – endpoints expose run state but never override Base44/ClickHouse authority.
Cloudflare Images (Icon Storage)
App icons are stored in Cloudflare Images to avoid hotlinking external CDNs:
| Aspect | Details |
|---|---|
| Why | Apple's mzstatic.com and Google's CDN URLs should not be hotlinked; they may block or rate limit. |
| Upload | src/lib/cloudflare-images.js uploads icons with ID {platform}_{app_id} for deduplication. |
| Delivery | URLs in format https://imagedelivery.net/{account_hash}/{image_id}/public. |
| Storage | icon_url column in apps table stores CF delivery URL (not original CDN URL). |
| Denormalization | app_icon_url in app_category_rankings also stores CF delivery URL for fast queries. |
| Fallback | If CF Images upload fails, original URL is preserved as fallback. |
Cloudflare D1 (App Store Data)
D1 stores all app store related data:
| Table | Purpose |
|---|---|
apps | Canonical app records with full metadata, CF Images icon URLs. |
app_category_rankings | Denormalized ranking snapshots with app metadata for fast queries. |
app_store_categories | Category hierarchy for Apple App Store and Google Play. |
brands | Developer/publisher brands with domain associations. |
app_recommendations | Similar apps and relationships discovered during crawling. |
domains / urls | Normalized URL and domain storage for deduplication. |
local_entities | Google Business Profile listings (CIDs) for local SEO tracking. |
keywords | Global keyword repository with classification data. |
domain_keyword_rankings | Weekly snapshots of what keywords a domain ranks for. |
Brand Architecture
Brands are the central entity that connects apps, domains, websites, local listings, and keyword rankings. The domain is the hub - it's the connective tissue between all data.
BRAND
│
│ primary_domain_id
▼
┌──────────────────────────────────────────────────────────┐
│ DOMAINS │
│ (the hub/pivot) │
│ │
│ spotify.com ←─ This is what connects everything │
└──────────────────────────────────────────────────────────┘
▲ ▲ ▲ ▲
│ │ │ │
┌────┴────┐ ┌─────┴─────┐ ┌─────┴─────┐ ┌────┴────┐
│ URLs │ │ APPS │ │ RANKINGS │ │ LOCAL │
│ │ │ │ │ │ │ENTITIES │
│domain_id│ │website_ │ │domain_id │ │website_ │
│ │ │domain_id │ │ │ │url→ │
└─────────┘ └───────────┘ └───────────┘ └─────────┘
Brand Table Schema
| Column | Type | Purpose |
|---|---|---|
id | INTEGER | Auto-increment primary key |
name | TEXT UNIQUE | Developer/company name (exact match) |
normalized_name | TEXT | Lowercase, suffix-stripped for fuzzy matching |
primary_domain | TEXT | Denormalized domain string for display |
primary_domain_id | INTEGER | FK to domains.id |
logo_url | TEXT | CF Images URL or original icon |
flagship_app_id | TEXT | Best app for this brand (highest ratings) |
app_count | INTEGER | Denormalized count for quick display |
source | TEXT | How brand was created: auto_crawl, dataforseo_webhook, manual, admin_backfill |
locked | INTEGER | If 1, automation won't overwrite |
Brand Creation Flow
Brands are created automatically during app crawling via src/lib/brand-management.js:
- App Details Consumer (
src/queue/app-details-consumer.js) - Apple/Google app crawls - DataForSEO App Info Webhook (
src/endpoints/dataforseo-app-info-webhook.js) - Postback processing - DataForSEO App List Webhook (
src/endpoints/dataforseo-app-list-webhook.js) - Chart ranking imports - Admin Backfill (
src/endpoints/admin-populate-brands.js) - Bulk populate from existing apps
All flows use ensureBrand() which:
- Checks if brand exists by exact name match
- Creates domain record via
ensureDomain()if website URL provided - Respects
lockedflag - won't update locked brands - Returns
{ id, domain_id }for linking to apps/rankings
Joining to Brands
Brands don't need to be denormalized everywhere. Use domain as the join path:
-- From ranking to brand via domain:
SELECT r.*, b.name as brand_name, b.logo_url
FROM app_category_rankings r
JOIN domains d ON r.website_domain_id = d.id
JOIN brands b ON b.primary_domain_id = d.id
-- From local entity to brand:
SELECT le.*, b.name as brand_name
FROM local_entities le
JOIN domains d ON d.domain = extract_domain(le.website_url)
JOIN brands b ON b.primary_domain_id = d.id
Denormalization Strategy
We denormalize brand_id only on high-volume query tables:
apps.brand_id- Direct FK for fast app→brand lookupsapp_category_rankings.brand_id- Avoid joins on ranking queries
Everything else can join through the domain table.
Single Source of Truth Principle
- Keywords are global in Base44/ClickHouse; removing
customer_idfrom the master keyword table prevents duplicates when multiple tenants track the same term. - Ownership is expressed through join tables (
customer_categories,category_keywords, etc.). - Fact data (positions, searches, CPC) lives only in ClickHouse.
- Entity mutations flow
React → Base44 → Worker (for read)and optionallyWorker → ClickHousefor metrics; there is no circular dependency.
Data Flow Examples
User Creates a Category in React
- React POSTs to Base44 → new
category_id. - Worker later syncs reference tables (or receives the category during harvest) → inserts into ClickHouse
categories+customer_categories. - Historical metrics remain untouched.
Harvest Completes
- Worker upserts keywords + relationships in Base44.
- Worker appends rows to
keyword_snapshots/monthly_keyword_searches. - React reads enriched keyword metadata from Base44 and charts trends with ClickHouse queries.
Result: One canonical definition per entity, append-only analytics, and zero redundant masters.
Domain & URL Classification Pipeline
How domains and URLs enter the system, get classified, and connect to brands.
Entry Points
Domains and URLs enter from multiple sources. Each source provides different data:
| Entry Point | Provides Domain | Provides URL | Provides Brand Name | Notes |
|---|---|---|---|---|
| User adds Asset (UI) | Yes (resolved) | No | No | UI calls /api/util/resolve-domain first |
| User adds Competitor (UI) | Yes (resolved) | No | No | Same resolution flow |
| App Store Crawl | Yes (developer_url) | Yes | Yes (developer_name) | Primary brand source |
| Ranked Keywords API | Yes (domain, main_domain) | Yes (url) | Yes (website_name) | Rich data including backlinks_info |
| Backlinks API | Yes (extracted) | Yes (url_from, url_to) | No | Source + target URLs |
Brand Name Resolution
When we need a brand name for a domain:
Domain Resolution Flow (UI Layer)
When a user manually enters a domain, the UI resolves www vs non-www before saving:
Complete Data Flow
Classification Queue Flow (Optimized)
Instead of queueing each URL individually, we batch by domain:
Classification Stages & Costs
The domain classifier uses a 7-stage cost-optimized pipeline - FREE stages run first:
| Stage | Domain | URL (Source) | URL (Target) | Keyword | Cost |
|---|---|---|---|---|---|
| 0. Cache Check | ✓ | ✓ | - | - | FREE |
| 1. Rules Engine | ✓ | ✓ | ✓ | ✓ | FREE |
| 1.5 Google Ads Categories | ✓ | - | - | - | FREE |
| 2. Vectorize | ✓ | ✓ | - | ✓ | FREE |
| 3. Low-Noise Crawl | ✓ | - | - | - | FREE |
| 4. Instant Pages | ✓ | ✓ | - | - | $0.000125 |
| 4.5 Domain Patterns | ✓ | - | - | - | FREE |
| 5. LLM Fallback | ✓ | ✓ | - | - | ~$0.0001 |
Low-Noise Crawl (Stage 3) - The key cost optimization:
- HEAD request + partial GET (Range: 0-8KB)
- Extracts
<head>metadata: title, description, og:type, generator - CMS detection: WordPress, Shopify, Ghost, Wix, etc.
- Handles ~70% of domains without needing Instant Pages
Target URLs are always FREE - rules-only classification, no external calls.
Key Optimizations
- Domain Resolution in UI - User sees resolved domain before saving, backend uses as-is
- Target URLs Inline - No queue hop for FREE classification
- Batch by Domain - One queue message per domain, not per URL
- Domain Cache - Classify domain once, reuse for all its URLs
- website_name → Brand - Auto-populate brands from DataForSEO data
- Domain as Fallback Brand - When no brand name available, use domain itself