Data Architecture & Redundancy
Why Base44 and ClickHouse can coexist without duplicating sources of truth.
Layer Responsibilities
| Layer | Owns | Writes | Reads |
|---|---|---|---|
| React + Base44 | Entities (projects, categories, keywords, business types, relationships) | React + worker upserts entities through Base44 API. | React UI, worker validation. |
| Cloudflare Worker | No authoritative data | Orchestrates enrichment → persistence. | Reads & writes KV/R2, Base44, ClickHouse, DataForSEO. |
| Cloudflare D1 | App store data (apps, rankings, categories, brands) | Worker writes via queue consumers. | API endpoints, React UI. |
| Cloudflare Images | App icons (uploaded from Apple/Google CDNs) | Worker uploads during app enrichment. | Frontend via delivery URLs. |
| ClickHouse | Time-series facts & analytical caches | Worker appends snapshots + monthly aggregates. | Reporting, dashboards, diagnostics. |
The worker never mutates data directly inside React/Base44; it uses authenticated API calls so Base44 remains canonical. ClickHouse never pushes back—everything flows Base44 → ClickHouse.
Base44 (Source of Truth)
- Stores entity definitions with metadata (names, slugs, colors, owners).
- Relationships (
category_keywords,project_keywords) live here, includingconfidence,assigned_by, tags, etc. - Business types, confirmed categories, and user edits are persisted via worker API calls.
- Any edit made in React is immediately authoritative; ClickHouse copies are refreshed but not the master.
ClickHouse (Analytics Store)
- Houses append-only fact tables:
keyword_snapshots– daily/hourly metrics per keyword/category/customer.monthly_keyword_searches– aggregated trend history.
- Mirrors Base44 reference tables (
customers,categories,keywords,category_keywords,customer_categories) strictly for query speed. They are populated via nightly syncs or worker upserts, not user edits. - Denormalization in fact tables is intentional (keyword text, search volume) to keep queries fast; replacement comes from Base44 + DataForSEO.
Worker Glue
- Enrichment – fetch HTML, run LLM, store metadata in KV (not authoritative).
- Harvest – combine DataForSEO + AI generated keywords.
- Persistence – upsert entities in Base44, append facts to ClickHouse, leave R2/KV breadcrumbs for observability.
- App Enrichment – fetch app details via HTML scrape + iTunes API, upload icons to CF Images, save to D1.
- Diagnostics – endpoints expose run state but never override Base44/ClickHouse authority.
Cloudflare Images (Icon Storage)
App icons are stored in Cloudflare Images to avoid hotlinking external CDNs:
| Aspect | Details |
|---|---|
| Why | Apple's mzstatic.com and Google's CDN URLs should not be hotlinked; they may block or rate limit. |
| Upload | src/lib/cloudflare-images.js uploads icons with ID {platform}_{app_id} for deduplication. |
| Delivery | URLs in format https://imagedelivery.net/{account_hash}/{image_id}/public. |
| Storage | icon_url column in apps table stores CF delivery URL (not original CDN URL). |
| Denormalization | app_icon_url in app_category_rankings also stores CF delivery URL for fast queries. |
| Fallback | If CF Images upload fails, original URL is preserved as fallback. |
Cloudflare D1 (App Store Data)
D1 stores all app store related data:
| Table | Purpose |
|---|---|
apps | Canonical app records with full metadata, CF Images icon URLs. |
app_category_rankings | Denormalized ranking snapshots with app metadata for fast queries. |
app_store_categories | Category hierarchy for Apple App Store and Google Play. |
brands | Developer/publisher brands with domain associations. |
app_recommendations | Similar apps and relationships discovered during crawling. |
domains / urls | Normalized URL and domain storage for deduplication. |
local_entities | Google Business Profile listings (CIDs) for local SEO tracking. |
keywords | Global keyword repository with classification data. |
domain_keyword_rankings | Weekly snapshots of what keywords a domain ranks for. |
Brand Architecture
Brands are the central entity that connects apps, domains, websites, local listings, and keyword rankings. The domain is the hub - it's the connective tissue between all data.
BRAND
│
│ primary_domain_id
▼
┌──────────────────────────────────────────────────────────┐
│ DOMAINS │
│ (the hub/pivot) │
│ │
│ spotify.com ←─ This is what connects everything │
└──────────────────────────────────────────────────────────┘
▲ ▲ ▲ ▲
│ │ │ │
┌────┴────┐ ┌─────┴─────┐ ┌─────┴─────┐ ┌────┴────┐
│ URLs │ │ APPS │ │ RANKINGS │ │ LOCAL │
│ │ │ │ │ │ │ENTITIES │
│domain_id│ │website_ │ │domain_id │ │website_ │
│ │ │domain_id │ │ │ │url→ │
└─────────┘ └───────────┘ └───────────┘ └─────────┘
Brand Table Schema
| Column | Type | Purpose |
|---|---|---|
id | INTEGER | Auto-increment primary key |
name | TEXT UNIQUE | Developer/company name (exact match) |
normalized_name | TEXT | Lowercase, suffix-stripped for fuzzy matching |
primary_domain | TEXT | Denormalized domain string for display |
primary_domain_id | INTEGER | FK to domains.id |
logo_url | TEXT | CF Images URL or original icon |
flagship_app_id | TEXT | Best app for this brand (highest ratings) |
app_count | INTEGER | Denormalized count for quick display |
source | TEXT | How brand was created: auto_crawl, dataforseo_webhook, manual, admin_backfill |
locked | INTEGER | If 1, automation won't overwrite |
Brand Creation Flow
Brands are created automatically during app crawling via src/lib/brand-management.js:
- App Details Consumer (
src/queue/app-details-consumer.js) - Apple/Google app crawls - DataForSEO App Info Webhook (
src/endpoints/dataforseo-app-info-webhook.js) - Postback processing - DataForSEO App List Webhook (
src/endpoints/dataforseo-app-list-webhook.js) - Chart ranking imports - Admin Backfill (
src/endpoints/admin-populate-brands.js) - Bulk populate from existing apps
All flows use ensureBrand() which:
- Checks if brand exists by exact name match
- Creates domain record via
ensureDomain()if website URL provided - Respects
lockedflag - won't update locked brands - Returns
{ id, domain_id }for linking to apps/rankings
Joining to Brands
Brands don't need to be denormalized everywhere. Use domain as the join path:
-- From ranking to brand via domain:
SELECT r.*, b.name as brand_name, b.logo_url
FROM app_category_rankings r
JOIN domains d ON r.website_domain_id = d.id
JOIN brands b ON b.primary_domain_id = d.id
-- From local entity to brand:
SELECT le.*, b.name as brand_name
FROM local_entities le
JOIN domains d ON d.domain = extract_domain(le.website_url)
JOIN brands b ON b.primary_domain_id = d.id
Denormalization Strategy
We denormalize brand_id only on high-volume query tables:
apps.brand_id- Direct FK for fast app→brand lookupsapp_category_rankings.brand_id- Avoid joins on ranking queries
Everything else can join through the domain table.
Single Source of Truth Principle
- Keywords are global in Base44/ClickHouse; removing
customer_idfrom the master keyword table prevents duplicates when multiple tenants track the same term. - Ownership is expressed through join tables (
customer_categories,category_keywords, etc.). - Fact data (positions, searches, CPC) lives only in ClickHouse.
- Entity mutations flow
React → Base44 → Worker (for read)and optionallyWorker → ClickHousefor metrics; there is no circular dependency.
Data Flow Examples
User Creates a Category in React
- React POSTs to Base44 → new
category_id. - Worker later syncs reference tables (or receives the category during harvest) → inserts into ClickHouse
categories+customer_categories. - Historical metrics remain untouched.
Harvest Completes
- Worker upserts keywords + relationships in Base44.
- Worker appends rows to
keyword_snapshots/monthly_keyword_searches. - React reads enriched keyword metadata from Base44 and charts trends with ClickHouse queries.
Result: One canonical definition per entity, append-only analytics, and zero redundant masters.
Domain & URL Classification Pipeline
How domains and URLs enter the system, get classified, and connect to brands.
Entry Points
Domains and URLs enter from multiple sources. Each source provides different data:
| Entry Point | Provides Domain | Provides URL | Provides Brand Name | Notes |
|---|---|---|---|---|
| User adds Asset (UI) | Yes (resolved) | No | No | UI calls /api/util/resolve-domain first |
| User adds Competitor (UI) | Yes (resolved) | No | No | Same resolution flow |
| App Store Crawl | Yes (developer_url) | Yes | Yes (developer_name) | Primary brand source |
| Ranked Keywords API | Yes (domain, main_domain) | Yes (url) | Yes (website_name) | Rich data including backlinks_info |
| Backlinks API | Yes (extracted) | Yes (url_from, url_to) | No | Source + target URLs |
Brand Name Resolution
When we need a brand name for a domain:
flowchart TD
A[Need brand name for domain] --> B{Have website_name<br/>from DataForSEO?}
B -->|Yes| C[Use website_name]
B -->|No| D{Have developer_name<br/>from App Store?}
D -->|Yes| E[Use developer_name]
D -->|No| F{Brand already exists<br/>for this domain?}
F -->|Yes| G[Use existing brand.name]
F -->|No| H[Use domain as brand name<br/>e.g. 'spotify.com']
C --> I[ensureBrand with name]
E --> I
G --> J[Link to existing brand]
H --> I
I --> K[brands table]
J --> K
Domain Resolution Flow (UI Layer)
When a user manually enters a domain, the UI resolves www vs non-www before saving:
sequenceDiagram
participant UI as React UI
participant API as Worker API
participant DB as D1 Database
UI->>API: POST /api/util/resolve-domain<br/>{"domain": "mermaidchart.com"}
API->>API: HTTP HEAD to domain<br/>Check for 301/302/307/308
API-->>UI: {"resolved": "www.mermaidchart.com", "redirected": true}
UI->>UI: Show user: "Using www.mermaidchart.com"
UI->>API: POST /api/assets<br/>{"domain": "www.mermaidchart.com"}
API->>DB: ensureDomain("www.mermaidchart.com")
Complete Data Flow
flowchart TB
subgraph EntryPoints["Entry Points"]
UI_Asset["User adds Asset<br/>(UI resolves domain first)"]
UI_Competitor["User adds Competitor<br/>(UI resolves domain first)"]
AppCrawl["App Store Crawl<br/>(developer_url, developer_name)"]
RankedKW["Ranked Keywords API<br/>(domain, url, website_name, backlinks_info)"]
BacklinksAPI["Backlinks API<br/>(url_from, url_to)"]
end
subgraph Resolution["Resolution Layer"]
ResolveDomain["POST /api/util/resolve-domain<br/>HTTP HEAD → www vs non-www"]
end
subgraph Processing["Processing Layer"]
EnsureDomain["ensureDomain()<br/>Creates/updates domain record"]
EnsureUrl["ensureUrl()<br/>Creates URL, triggers classification"]
EnsureBrand["ensureBrand()<br/>Creates brand with fallback name"]
BrandLogic{"Brand name<br/>available?"}
BrandFallback["Use domain as brand name"]
end
subgraph Classification["Classification Layer"]
DomainClassify["Domain Classification<br/>(rules → vectorize → LLM)"]
UrlClassify["URL Classification<br/>(inline for targets, queued for sources)"]
KeywordClassify["Keyword Classification<br/>(rules → vectorize)"]
end
subgraph Storage["Storage Layer"]
Domains[("domains<br/>domain_type, channel_bucket")]
Urls[("urls<br/>page_type, classification_source")]
Brands[("brands<br/>name, primary_domain_id")]
Keywords[("keywords<br/>funnel_stage, intent_type")]
end
%% Entry flows
UI_Asset --> ResolveDomain
UI_Competitor --> ResolveDomain
ResolveDomain --> EnsureDomain
AppCrawl --> EnsureDomain
AppCrawl --> EnsureBrand
RankedKW --> EnsureDomain
RankedKW --> EnsureUrl
RankedKW --> BrandLogic
BrandLogic -->|website_name exists| EnsureBrand
BrandLogic -->|no website_name| BrandFallback
BrandFallback --> EnsureBrand
BacklinksAPI --> EnsureUrl
%% Processing to classification
EnsureDomain --> DomainClassify
EnsureUrl --> UrlClassify
%% Classification to storage
DomainClassify --> Domains
UrlClassify --> Urls
EnsureBrand --> Brands
KeywordClassify --> Keywords
%% Domain is the hub
Domains -.->|domain_id| Urls
Domains -.->|primary_domain_id| Brands
Classification Queue Flow (Optimized)
Instead of queueing each URL individually, we batch by domain:
flowchart TB
subgraph Ingestion["Data Ingestion"]
Backlinks["Fetch 100 backlinks<br/>from DataForSEO"]
end
subgraph BatchPrep["Batch Preparation (Inline)"]
Store["storeBacklinks()"]
GroupByDomain["Group URLs by domain"]
CheckCache{"Domain<br/>cached?"}
Store --> GroupByDomain
GroupByDomain --> CheckCache
end
subgraph Queuing["Smart Queuing"]
InlineClassify["Classify inline<br/>(target URLs - FREE)"]
QueueBatch["Queue ONE message per domain<br/>{domain, urls: [...]}"]
CheckCache -->|"Yes + simple URL"| InlineClassify
CheckCache -->|"No or complex URL"| QueueBatch
end
subgraph Consumer["Queue Consumer"]
Receive["Receive batched message"]
ClassifyDomain["Classify domain ONCE"]
ClassifyUrls["Classify all URLs<br/>using domain cache"]
Receive --> ClassifyDomain
ClassifyDomain --> ClassifyUrls
end
subgraph Results["Results"]
DomainTable[("domains")]
UrlTable[("urls")]
ClassifyDomain --> DomainTable
ClassifyUrls --> UrlTable
end
Backlinks --> Store
InlineClassify --> UrlTable
QueueBatch --> Receive
Classification Stages & Costs
The domain classifier uses a 7-stage cost-optimized pipeline - FREE stages run first:
flowchart LR
subgraph FREE["FREE Stages"]
S0[0. Cache] --> S1[1. Rules]
S1 --> S1_5[1.5 Google Ads<br/>Categories]
S1_5 --> S2[2. Vectorize]
S2 --> S3[3. Low-Noise<br/>Crawl]
end
subgraph PAID["PAID (if needed)"]
S3 --> S4[4. Instant Pages]
S4 --> S4_5[4.5 Domain Patterns]
S4_5 --> S5[5. LLM]
end
S3 -->|"≥70%"| DONE[Done]
S4 -->|"≥70%"| DONE
S5 --> DONE
style S3 fill:#c8e6c9
style S4 fill:#fff3e0
style S5 fill:#ffcdd2
| Stage | Domain | URL (Source) | URL (Target) | Keyword | Cost |
|---|---|---|---|---|---|
| 0. Cache Check | ✓ | ✓ | - | - | FREE |
| 1. Rules Engine | ✓ | ✓ | ✓ | ✓ | FREE |
| 1.5 Google Ads Categories | ✓ | - | - | - | FREE |
| 2. Vectorize | ✓ | ✓ | - | ✓ | FREE |
| 3. Low-Noise Crawl | ✓ | - | - | - | FREE |
| 4. Instant Pages | ✓ | ✓ | - | - | $0.000125 |
| 4.5 Domain Patterns | ✓ | - | - | - | FREE |
| 5. LLM Fallback | ✓ | ✓ | - | - | ~$0.0001 |
Low-Noise Crawl (Stage 3) - The key cost optimization:
- HEAD request + partial GET (Range: 0-8KB)
- Extracts
<head>metadata: title, description, og:type, generator - CMS detection: WordPress, Shopify, Ghost, Wix, etc.
- Handles ~70% of domains without needing Instant Pages
Target URLs are always FREE - rules-only classification, no external calls.
Key Optimizations
- Domain Resolution in UI - User sees resolved domain before saving, backend uses as-is
- Target URLs Inline - No queue hop for FREE classification
- Batch by Domain - One queue message per domain, not per URL
- Domain Cache - Classify domain once, reuse for all its URLs
- website_name → Brand - Auto-populate brands from DataForSEO data
- Domain as Fallback Brand - When no brand name available, use domain itself