Skip to main content

Data Architecture & Redundancy

Why Base44 and ClickHouse can coexist without duplicating sources of truth.

Layer Responsibilities

LayerOwnsWritesReads
React + Base44Entities (projects, categories, keywords, business types, relationships)React + worker upserts entities through Base44 API.React UI, worker validation.
Cloudflare WorkerNo authoritative dataOrchestrates enrichment → persistence.Reads & writes KV/R2, Base44, ClickHouse, DataForSEO.
Cloudflare D1App store data (apps, rankings, categories, brands)Worker writes via queue consumers.API endpoints, React UI.
Cloudflare ImagesApp icons (uploaded from Apple/Google CDNs)Worker uploads during app enrichment.Frontend via delivery URLs.
ClickHouseTime-series facts & analytical cachesWorker appends snapshots + monthly aggregates.Reporting, dashboards, diagnostics.

The worker never mutates data directly inside React/Base44; it uses authenticated API calls so Base44 remains canonical. ClickHouse never pushes back—everything flows Base44 → ClickHouse.

Base44 (Source of Truth)

  • Stores entity definitions with metadata (names, slugs, colors, owners).
  • Relationships (category_keywords, project_keywords) live here, including confidence, assigned_by, tags, etc.
  • Business types, confirmed categories, and user edits are persisted via worker API calls.
  • Any edit made in React is immediately authoritative; ClickHouse copies are refreshed but not the master.

ClickHouse (Analytics Store)

  • Houses append-only fact tables:
    • keyword_snapshots – daily/hourly metrics per keyword/category/customer.
    • monthly_keyword_searches – aggregated trend history.
  • Mirrors Base44 reference tables (customers, categories, keywords, category_keywords, customer_categories) strictly for query speed. They are populated via nightly syncs or worker upserts, not user edits.
  • Denormalization in fact tables is intentional (keyword text, search volume) to keep queries fast; replacement comes from Base44 + DataForSEO.

Worker Glue

  1. Enrichment – fetch HTML, run LLM, store metadata in KV (not authoritative).
  2. Harvest – combine DataForSEO + AI generated keywords.
  3. Persistence – upsert entities in Base44, append facts to ClickHouse, leave R2/KV breadcrumbs for observability.
  4. App Enrichment – fetch app details via HTML scrape + iTunes API, upload icons to CF Images, save to D1.
  5. Diagnostics – endpoints expose run state but never override Base44/ClickHouse authority.

Cloudflare Images (Icon Storage)

App icons are stored in Cloudflare Images to avoid hotlinking external CDNs:

AspectDetails
WhyApple's mzstatic.com and Google's CDN URLs should not be hotlinked; they may block or rate limit.
Uploadsrc/lib/cloudflare-images.js uploads icons with ID {platform}_{app_id} for deduplication.
DeliveryURLs in format https://imagedelivery.net/{account_hash}/{image_id}/public.
Storageicon_url column in apps table stores CF delivery URL (not original CDN URL).
Denormalizationapp_icon_url in app_category_rankings also stores CF delivery URL for fast queries.
FallbackIf CF Images upload fails, original URL is preserved as fallback.

Cloudflare D1 (App Store Data)

D1 stores all app store related data:

TablePurpose
appsCanonical app records with full metadata, CF Images icon URLs.
app_category_rankingsDenormalized ranking snapshots with app metadata for fast queries.
app_store_categoriesCategory hierarchy for Apple App Store and Google Play.
brandsDeveloper/publisher brands with domain associations.
app_recommendationsSimilar apps and relationships discovered during crawling.
domains / urlsNormalized URL and domain storage for deduplication.
local_entitiesGoogle Business Profile listings (CIDs) for local SEO tracking.
keywordsGlobal keyword repository with classification data.
domain_keyword_rankingsWeekly snapshots of what keywords a domain ranks for.

Brand Architecture

Brands are the central entity that connects apps, domains, websites, local listings, and keyword rankings. The domain is the hub - it's the connective tissue between all data.

                                 BRAND

│ primary_domain_id

┌──────────────────────────────────────────────────────────┐
│ DOMAINS │
│ (the hub/pivot) │
│ │
│ spotify.com ←─ This is what connects everything │
└──────────────────────────────────────────────────────────┘
▲ ▲ ▲ ▲
│ │ │ │
┌────┴────┐ ┌─────┴─────┐ ┌─────┴─────┐ ┌────┴────┐
│ URLs │ │ APPS │ │ RANKINGS │ │ LOCAL │
│ │ │ │ │ │ │ENTITIES │
│domain_id│ │website_ │ │domain_id │ │website_ │
│ │ │domain_id │ │ │ │url→ │
└─────────┘ └───────────┘ └───────────┘ └─────────┘

Brand Table Schema

ColumnTypePurpose
idINTEGERAuto-increment primary key
nameTEXT UNIQUEDeveloper/company name (exact match)
normalized_nameTEXTLowercase, suffix-stripped for fuzzy matching
primary_domainTEXTDenormalized domain string for display
primary_domain_idINTEGERFK to domains.id
logo_urlTEXTCF Images URL or original icon
flagship_app_idTEXTBest app for this brand (highest ratings)
app_countINTEGERDenormalized count for quick display
sourceTEXTHow brand was created: auto_crawl, dataforseo_webhook, manual, admin_backfill
lockedINTEGERIf 1, automation won't overwrite

Brand Creation Flow

Brands are created automatically during app crawling via src/lib/brand-management.js:

  1. App Details Consumer (src/queue/app-details-consumer.js) - Apple/Google app crawls
  2. DataForSEO App Info Webhook (src/endpoints/dataforseo-app-info-webhook.js) - Postback processing
  3. DataForSEO App List Webhook (src/endpoints/dataforseo-app-list-webhook.js) - Chart ranking imports
  4. Admin Backfill (src/endpoints/admin-populate-brands.js) - Bulk populate from existing apps

All flows use ensureBrand() which:

  • Checks if brand exists by exact name match
  • Creates domain record via ensureDomain() if website URL provided
  • Respects locked flag - won't update locked brands
  • Returns { id, domain_id } for linking to apps/rankings

Joining to Brands

Brands don't need to be denormalized everywhere. Use domain as the join path:

-- From ranking to brand via domain:
SELECT r.*, b.name as brand_name, b.logo_url
FROM app_category_rankings r
JOIN domains d ON r.website_domain_id = d.id
JOIN brands b ON b.primary_domain_id = d.id

-- From local entity to brand:
SELECT le.*, b.name as brand_name
FROM local_entities le
JOIN domains d ON d.domain = extract_domain(le.website_url)
JOIN brands b ON b.primary_domain_id = d.id

Denormalization Strategy

We denormalize brand_id only on high-volume query tables:

  • apps.brand_id - Direct FK for fast app→brand lookups
  • app_category_rankings.brand_id - Avoid joins on ranking queries

Everything else can join through the domain table.

Single Source of Truth Principle

  • Keywords are global in Base44/ClickHouse; removing customer_id from the master keyword table prevents duplicates when multiple tenants track the same term.
  • Ownership is expressed through join tables (customer_categories, category_keywords, etc.).
  • Fact data (positions, searches, CPC) lives only in ClickHouse.
  • Entity mutations flow React → Base44 → Worker (for read) and optionally Worker → ClickHouse for metrics; there is no circular dependency.

Data Flow Examples

User Creates a Category in React

  1. React POSTs to Base44 → new category_id.
  2. Worker later syncs reference tables (or receives the category during harvest) → inserts into ClickHouse categories + customer_categories.
  3. Historical metrics remain untouched.

Harvest Completes

  1. Worker upserts keywords + relationships in Base44.
  2. Worker appends rows to keyword_snapshots/monthly_keyword_searches.
  3. React reads enriched keyword metadata from Base44 and charts trends with ClickHouse queries.

Result: One canonical definition per entity, append-only analytics, and zero redundant masters.


Domain & URL Classification Pipeline

How domains and URLs enter the system, get classified, and connect to brands.

Entry Points

Domains and URLs enter from multiple sources. Each source provides different data:

Entry PointProvides DomainProvides URLProvides Brand NameNotes
User adds Asset (UI)Yes (resolved)NoNoUI calls /api/util/resolve-domain first
User adds Competitor (UI)Yes (resolved)NoNoSame resolution flow
App Store CrawlYes (developer_url)YesYes (developer_name)Primary brand source
Ranked Keywords APIYes (domain, main_domain)Yes (url)Yes (website_name)Rich data including backlinks_info
Backlinks APIYes (extracted)Yes (url_from, url_to)NoSource + target URLs

Brand Name Resolution

When we need a brand name for a domain:

flowchart TD
A[Need brand name for domain] --> B{Have website_name<br/>from DataForSEO?}
B -->|Yes| C[Use website_name]
B -->|No| D{Have developer_name<br/>from App Store?}
D -->|Yes| E[Use developer_name]
D -->|No| F{Brand already exists<br/>for this domain?}
F -->|Yes| G[Use existing brand.name]
F -->|No| H[Use domain as brand name<br/>e.g. 'spotify.com']

C --> I[ensureBrand with name]
E --> I
G --> J[Link to existing brand]
H --> I

I --> K[brands table]
J --> K

Domain Resolution Flow (UI Layer)

When a user manually enters a domain, the UI resolves www vs non-www before saving:

sequenceDiagram
participant UI as React UI
participant API as Worker API
participant DB as D1 Database

UI->>API: POST /api/util/resolve-domain<br/>{"domain": "mermaidchart.com"}
API->>API: HTTP HEAD to domain<br/>Check for 301/302/307/308
API-->>UI: {"resolved": "www.mermaidchart.com", "redirected": true}
UI->>UI: Show user: "Using www.mermaidchart.com"
UI->>API: POST /api/assets<br/>{"domain": "www.mermaidchart.com"}
API->>DB: ensureDomain("www.mermaidchart.com")

Complete Data Flow

flowchart TB
subgraph EntryPoints["Entry Points"]
UI_Asset["User adds Asset<br/>(UI resolves domain first)"]
UI_Competitor["User adds Competitor<br/>(UI resolves domain first)"]
AppCrawl["App Store Crawl<br/>(developer_url, developer_name)"]
RankedKW["Ranked Keywords API<br/>(domain, url, website_name, backlinks_info)"]
BacklinksAPI["Backlinks API<br/>(url_from, url_to)"]
end

subgraph Resolution["Resolution Layer"]
ResolveDomain["POST /api/util/resolve-domain<br/>HTTP HEAD → www vs non-www"]
end

subgraph Processing["Processing Layer"]
EnsureDomain["ensureDomain()<br/>Creates/updates domain record"]
EnsureUrl["ensureUrl()<br/>Creates URL, triggers classification"]
EnsureBrand["ensureBrand()<br/>Creates brand with fallback name"]

BrandLogic{"Brand name<br/>available?"}
BrandFallback["Use domain as brand name"]
end

subgraph Classification["Classification Layer"]
DomainClassify["Domain Classification<br/>(rules → vectorize → LLM)"]
UrlClassify["URL Classification<br/>(inline for targets, queued for sources)"]
KeywordClassify["Keyword Classification<br/>(rules → vectorize)"]
end

subgraph Storage["Storage Layer"]
Domains[("domains<br/>domain_type, channel_bucket")]
Urls[("urls<br/>page_type, classification_source")]
Brands[("brands<br/>name, primary_domain_id")]
Keywords[("keywords<br/>funnel_stage, intent_type")]
end

%% Entry flows
UI_Asset --> ResolveDomain
UI_Competitor --> ResolveDomain
ResolveDomain --> EnsureDomain

AppCrawl --> EnsureDomain
AppCrawl --> EnsureBrand

RankedKW --> EnsureDomain
RankedKW --> EnsureUrl
RankedKW --> BrandLogic
BrandLogic -->|website_name exists| EnsureBrand
BrandLogic -->|no website_name| BrandFallback
BrandFallback --> EnsureBrand

BacklinksAPI --> EnsureUrl

%% Processing to classification
EnsureDomain --> DomainClassify
EnsureUrl --> UrlClassify

%% Classification to storage
DomainClassify --> Domains
UrlClassify --> Urls
EnsureBrand --> Brands
KeywordClassify --> Keywords

%% Domain is the hub
Domains -.->|domain_id| Urls
Domains -.->|primary_domain_id| Brands

Classification Queue Flow (Optimized)

Instead of queueing each URL individually, we batch by domain:

flowchart TB
subgraph Ingestion["Data Ingestion"]
Backlinks["Fetch 100 backlinks<br/>from DataForSEO"]
end

subgraph BatchPrep["Batch Preparation (Inline)"]
Store["storeBacklinks()"]
GroupByDomain["Group URLs by domain"]
CheckCache{"Domain<br/>cached?"}

Store --> GroupByDomain
GroupByDomain --> CheckCache
end

subgraph Queuing["Smart Queuing"]
InlineClassify["Classify inline<br/>(target URLs - FREE)"]
QueueBatch["Queue ONE message per domain<br/>{domain, urls: [...]}"]

CheckCache -->|"Yes + simple URL"| InlineClassify
CheckCache -->|"No or complex URL"| QueueBatch
end

subgraph Consumer["Queue Consumer"]
Receive["Receive batched message"]
ClassifyDomain["Classify domain ONCE"]
ClassifyUrls["Classify all URLs<br/>using domain cache"]

Receive --> ClassifyDomain
ClassifyDomain --> ClassifyUrls
end

subgraph Results["Results"]
DomainTable[("domains")]
UrlTable[("urls")]

ClassifyDomain --> DomainTable
ClassifyUrls --> UrlTable
end

Backlinks --> Store
InlineClassify --> UrlTable
QueueBatch --> Receive

Classification Stages & Costs

The domain classifier uses a 7-stage cost-optimized pipeline - FREE stages run first:

flowchart LR
subgraph FREE["FREE Stages"]
S0[0. Cache] --> S1[1. Rules]
S1 --> S1_5[1.5 Google Ads<br/>Categories]
S1_5 --> S2[2. Vectorize]
S2 --> S3[3. Low-Noise<br/>Crawl]
end

subgraph PAID["PAID (if needed)"]
S3 --> S4[4. Instant Pages]
S4 --> S4_5[4.5 Domain Patterns]
S4_5 --> S5[5. LLM]
end

S3 -->|"≥70%"| DONE[Done]
S4 -->|"≥70%"| DONE
S5 --> DONE

style S3 fill:#c8e6c9
style S4 fill:#fff3e0
style S5 fill:#ffcdd2
StageDomainURL (Source)URL (Target)KeywordCost
0. Cache Check--FREE
1. Rules EngineFREE
1.5 Google Ads Categories---FREE
2. Vectorize-FREE
3. Low-Noise Crawl---FREE
4. Instant Pages--$0.000125
4.5 Domain Patterns---FREE
5. LLM Fallback--~$0.0001

Low-Noise Crawl (Stage 3) - The key cost optimization:

  • HEAD request + partial GET (Range: 0-8KB)
  • Extracts <head> metadata: title, description, og:type, generator
  • CMS detection: WordPress, Shopify, Ghost, Wix, etc.
  • Handles ~70% of domains without needing Instant Pages

Target URLs are always FREE - rules-only classification, no external calls.

Key Optimizations

  1. Domain Resolution in UI - User sees resolved domain before saving, backend uses as-is
  2. Target URLs Inline - No queue hop for FREE classification
  3. Batch by Domain - One queue message per domain, not per URL
  4. Domain Cache - Classify domain once, reuse for all its URLs
  5. website_name → Brand - Auto-populate brands from DataForSEO data
  6. Domain as Fallback Brand - When no brand name available, use domain itself