Data Architecture & Redundancy

Why Base44 and ClickHouse can coexist without duplicating sources of truth.

Layer Responsibilities

Layer	Owns	Writes	Reads
React + Base44	Entities (projects, categories, keywords, business types, relationships)	React + worker upserts entities through Base44 API.	React UI, worker validation.
Cloudflare Worker	No authoritative data	Orchestrates enrichment → persistence.	Reads & writes KV/R2, Base44, ClickHouse, DataForSEO.
Cloudflare D1	App store data (apps, rankings, categories, brands)	Worker writes via queue consumers.	API endpoints, React UI.
Cloudflare Images	App icons (uploaded from Apple/Google CDNs)	Worker uploads during app enrichment.	Frontend via delivery URLs.
ClickHouse	Time-series facts & analytical caches	Worker appends snapshots + monthly aggregates.	Reporting, dashboards, diagnostics.

The worker never mutates data directly inside React/Base44; it uses authenticated API calls so Base44 remains canonical. ClickHouse never pushes back—everything flows Base44 → ClickHouse.

Base44 (Source of Truth)

Stores entity definitions with metadata (names, slugs, colors, owners).
Relationships (category_keywords, project_keywords) live here, including confidence, assigned_by, tags, etc.
Business types, confirmed categories, and user edits are persisted via worker API calls.
Any edit made in React is immediately authoritative; ClickHouse copies are refreshed but not the master.

ClickHouse (Analytics Store)

Houses append-only fact tables:
- keyword_snapshots – daily/hourly metrics per keyword/category/customer.
- monthly_keyword_searches – aggregated trend history.
Mirrors Base44 reference tables (customers, categories, keywords, category_keywords, customer_categories) strictly for query speed. They are populated via nightly syncs or worker upserts, not user edits.
Denormalization in fact tables is intentional (keyword text, search volume) to keep queries fast; replacement comes from Base44 + DataForSEO.

Worker Glue

Enrichment – fetch HTML, run LLM, store metadata in KV (not authoritative).
Harvest – combine DataForSEO + AI generated keywords.
Persistence – upsert entities in Base44, append facts to ClickHouse, leave R2/KV breadcrumbs for observability.
App Enrichment – fetch app details via HTML scrape + iTunes API, upload icons to CF Images, save to D1.
Diagnostics – endpoints expose run state but never override Base44/ClickHouse authority.

Cloudflare Images (Icon Storage)

App icons are stored in Cloudflare Images to avoid hotlinking external CDNs:

Aspect	Details
Why	Apple's `mzstatic.com` and Google's CDN URLs should not be hotlinked; they may block or rate limit.
Upload	`src/lib/cloudflare-images.js` uploads icons with ID `{platform}_{app_id}` for deduplication.
Delivery	URLs in format `https://imagedelivery.net/{account_hash}/{image_id}/public`.
Storage	`icon_url` column in `apps` table stores CF delivery URL (not original CDN URL).
Denormalization	`app_icon_url` in `app_category_rankings` also stores CF delivery URL for fast queries.
Fallback	If CF Images upload fails, original URL is preserved as fallback.

Cloudflare D1 (App Store Data)

D1 stores all app store related data:

Table	Purpose
`apps`	Canonical app records with full metadata, CF Images icon URLs.
`app_category_rankings`	Denormalized ranking snapshots with app metadata for fast queries.
`app_store_categories`	Category hierarchy for Apple App Store and Google Play.
`brands`	Developer/publisher brands with domain associations.
`app_recommendations`	Similar apps and relationships discovered during crawling.
`domains` / `urls`	Normalized URL and domain storage for deduplication.
`local_entities`	Google Business Profile listings (CIDs) for local SEO tracking.
`keywords`	Global keyword repository with classification data.
`domain_keyword_rankings`	Weekly snapshots of what keywords a domain ranks for.

Brand Architecture

Brands are the central entity that connects apps, domains, websites, local listings, and keyword rankings. The domain is the hub - it's the connective tissue between all data.

                                 BRAND
                                   │
                                   │ primary_domain_id
                                   ▼
    ┌──────────────────────────────────────────────────────────┐
    │                        DOMAINS                            │
    │                    (the hub/pivot)                        │
    │                                                           │
    │  spotify.com ←─ This is what connects everything          │
    └──────────────────────────────────────────────────────────┘
           ▲              ▲              ▲              ▲
           │              │              │              │
      ┌────┴────┐   ┌─────┴─────┐  ┌─────┴─────┐  ┌────┴────┐
      │  URLs   │   │   APPS    │  │ RANKINGS  │  │ LOCAL   │
      │         │   │           │  │           │  │ENTITIES │
      │domain_id│   │website_   │  │domain_id  │  │website_ │
      │         │   │domain_id  │  │           │  │url→     │
      └─────────┘   └───────────┘  └───────────┘  └─────────┘

Brand Table Schema

Column	Type	Purpose
`id`	INTEGER	Auto-increment primary key
`name`	TEXT UNIQUE	Developer/company name (exact match)
`normalized_name`	TEXT	Lowercase, suffix-stripped for fuzzy matching
`primary_domain`	TEXT	Denormalized domain string for display
`primary_domain_id`	INTEGER	FK to domains.id
`logo_url`	TEXT	CF Images URL or original icon
`flagship_app_id`	TEXT	Best app for this brand (highest ratings)
`app_count`	INTEGER	Denormalized count for quick display
`source`	TEXT	How brand was created: auto_crawl, dataforseo_webhook, manual, admin_backfill
`locked`	INTEGER	If 1, automation won't overwrite

Brand Creation Flow

Brands are created automatically during app crawling via src/lib/brand-management.js:

App Details Consumer (src/queue/app-details-consumer.js) - Apple/Google app crawls
DataForSEO App Info Webhook (src/endpoints/dataforseo-app-info-webhook.js) - Postback processing
DataForSEO App List Webhook (src/endpoints/dataforseo-app-list-webhook.js) - Chart ranking imports
Admin Backfill (src/endpoints/admin-populate-brands.js) - Bulk populate from existing apps

All flows use ensureBrand() which:

Checks if brand exists by exact name match
Creates domain record via ensureDomain() if website URL provided
Respects locked flag - won't update locked brands
Returns { id, domain_id } for linking to apps/rankings

Joining to Brands

Brands don't need to be denormalized everywhere. Use domain as the join path:

-- From ranking to brand via domain:
SELECT r.*, b.name as brand_name, b.logo_url
FROM app_category_rankings r
JOIN domains d ON r.website_domain_id = d.id
JOIN brands b ON b.primary_domain_id = d.id

-- From local entity to brand:
SELECT le.*, b.name as brand_name
FROM local_entities le
JOIN domains d ON d.domain = extract_domain(le.website_url)
JOIN brands b ON b.primary_domain_id = d.id

Denormalization Strategy

We denormalize brand_id only on high-volume query tables:

apps.brand_id - Direct FK for fast app→brand lookups
app_category_rankings.brand_id - Avoid joins on ranking queries

Everything else can join through the domain table.

Single Source of Truth Principle

Keywords are global in Base44/ClickHouse; removing customer_id from the master keyword table prevents duplicates when multiple tenants track the same term.
Ownership is expressed through join tables (customer_categories, category_keywords, etc.).
Fact data (positions, searches, CPC) lives only in ClickHouse.
Entity mutations flow React → Base44 → Worker (for read) and optionally Worker → ClickHouse for metrics; there is no circular dependency.

Data Flow Examples

User Creates a Category in React

React POSTs to Base44 → new category_id.
Worker later syncs reference tables (or receives the category during harvest) → inserts into ClickHouse categories + customer_categories.
Historical metrics remain untouched.

Harvest Completes

Worker upserts keywords + relationships in Base44.
Worker appends rows to keyword_snapshots/monthly_keyword_searches.
React reads enriched keyword metadata from Base44 and charts trends with ClickHouse queries.

Result: One canonical definition per entity, append-only analytics, and zero redundant masters.

Domain & URL Classification Pipeline

How domains and URLs enter the system, get classified, and connect to brands.

Entry Points

Domains and URLs enter from multiple sources. Each source provides different data:

Entry Point	Provides Domain	Provides URL	Provides Brand Name	Notes
User adds Asset (UI)	Yes (resolved)	No	No	UI calls `/api/util/resolve-domain` first
User adds Competitor (UI)	Yes (resolved)	No	No	Same resolution flow
App Store Crawl	Yes (`developer_url`)	Yes	Yes (`developer_name`)	Primary brand source
Ranked Keywords API	Yes (`domain`, `main_domain`)	Yes (`url`)	Yes (`website_name`)	Rich data including backlinks_info
Backlinks API	Yes (extracted)	Yes (`url_from`, `url_to`)	No	Source + target URLs

Brand Name Resolution

When we need a brand name for a domain:

flowchart TD
    A[Need brand name for domain] --> B{Have website_name<br/>from DataForSEO?}
    B -->|Yes| C[Use website_name]
    B -->|No| D{Have developer_name<br/>from App Store?}
    D -->|Yes| E[Use developer_name]
    D -->|No| F{Brand already exists<br/>for this domain?}
    F -->|Yes| G[Use existing brand.name]
    F -->|No| H[Use domain as brand name<br/>e.g. 'spotify.com']
    
    C --> I[ensureBrand with name]
    E --> I
    G --> J[Link to existing brand]
    H --> I
    
    I --> K[brands table]
    J --> K

Domain Resolution Flow (UI Layer)

When a user manually enters a domain, the UI resolves www vs non-www before saving:

sequenceDiagram
    participant UI as React UI
    participant API as Worker API
    participant DB as D1 Database
    
    UI->>API: POST /api/util/resolve-domain<br/>{"domain": "mermaidchart.com"}
    API->>API: HTTP HEAD to domain<br/>Check for 301/302/307/308
    API-->>UI: {"resolved": "www.mermaidchart.com", "redirected": true}
    UI->>UI: Show user: "Using www.mermaidchart.com"
    UI->>API: POST /api/assets<br/>{"domain": "www.mermaidchart.com"}
    API->>DB: ensureDomain("www.mermaidchart.com")

Complete Data Flow

flowchart TB
    subgraph EntryPoints["Entry Points"]
        UI_Asset["User adds Asset<br/>(UI resolves domain first)"]
        UI_Competitor["User adds Competitor<br/>(UI resolves domain first)"]
        AppCrawl["App Store Crawl<br/>(developer_url, developer_name)"]
        RankedKW["Ranked Keywords API<br/>(domain, url, website_name, backlinks_info)"]
        BacklinksAPI["Backlinks API<br/>(url_from, url_to)"]
    end

    subgraph Resolution["Resolution Layer"]
        ResolveDomain["POST /api/util/resolve-domain<br/>HTTP HEAD → www vs non-www"]
    end

    subgraph Processing["Processing Layer"]
        EnsureDomain["ensureDomain()<br/>Creates/updates domain record"]
        EnsureUrl["ensureUrl()<br/>Creates URL, triggers classification"]
        EnsureBrand["ensureBrand()<br/>Creates brand with fallback name"]
        
        BrandLogic{"Brand name<br/>available?"}
        BrandFallback["Use domain as brand name"]
    end

    subgraph Classification["Classification Layer"]
        DomainClassify["Domain Classification<br/>(rules → vectorize → LLM)"]
        UrlClassify["URL Classification<br/>(inline for targets, queued for sources)"]
        KeywordClassify["Keyword Classification<br/>(rules → vectorize)"]
    end

    subgraph Storage["Storage Layer"]
        Domains[("domains<br/>domain_type, channel_bucket")]
        Urls[("urls<br/>page_type, classification_source")]
        Brands[("brands<br/>name, primary_domain_id")]
        Keywords[("keywords<br/>funnel_stage, intent_type")]
    end

    %% Entry flows
    UI_Asset --> ResolveDomain
    UI_Competitor --> ResolveDomain
    ResolveDomain --> EnsureDomain
    
    AppCrawl --> EnsureDomain
    AppCrawl --> EnsureBrand
    
    RankedKW --> EnsureDomain
    RankedKW --> EnsureUrl
    RankedKW --> BrandLogic
    BrandLogic -->|website_name exists| EnsureBrand
    BrandLogic -->|no website_name| BrandFallback
    BrandFallback --> EnsureBrand
    
    BacklinksAPI --> EnsureUrl

    %% Processing to classification
    EnsureDomain --> DomainClassify
    EnsureUrl --> UrlClassify
    
    %% Classification to storage
    DomainClassify --> Domains
    UrlClassify --> Urls
    EnsureBrand --> Brands
    KeywordClassify --> Keywords
    
    %% Domain is the hub
    Domains -.->|domain_id| Urls
    Domains -.->|primary_domain_id| Brands

Classification Queue Flow (Optimized)

Instead of queueing each URL individually, we batch by domain:

flowchart TB
    subgraph Ingestion["Data Ingestion"]
        Backlinks["Fetch 100 backlinks<br/>from DataForSEO"]
    end

    subgraph BatchPrep["Batch Preparation (Inline)"]
        Store["storeBacklinks()"]
        GroupByDomain["Group URLs by domain"]
        CheckCache{"Domain<br/>cached?"}
        
        Store --> GroupByDomain
        GroupByDomain --> CheckCache
    end

    subgraph Queuing["Smart Queuing"]
        InlineClassify["Classify inline<br/>(target URLs - FREE)"]
        QueueBatch["Queue ONE message per domain<br/>{domain, urls: [...]}"]
        
        CheckCache -->|"Yes + simple URL"| InlineClassify
        CheckCache -->|"No or complex URL"| QueueBatch
    end

    subgraph Consumer["Queue Consumer"]
        Receive["Receive batched message"]
        ClassifyDomain["Classify domain ONCE"]
        ClassifyUrls["Classify all URLs<br/>using domain cache"]
        
        Receive --> ClassifyDomain
        ClassifyDomain --> ClassifyUrls
    end

    subgraph Results["Results"]
        DomainTable[("domains")]
        UrlTable[("urls")]
        
        ClassifyDomain --> DomainTable
        ClassifyUrls --> UrlTable
    end

    Backlinks --> Store
    InlineClassify --> UrlTable
    QueueBatch --> Receive

Classification Stages & Costs

The domain classifier uses a 7-stage cost-optimized pipeline - FREE stages run first:

flowchart LR
    subgraph FREE["FREE Stages"]
        S0[0. Cache] --> S1[1. Rules]
        S1 --> S1_5[1.5 Google Ads<br/>Categories]
        S1_5 --> S2[2. Vectorize]
        S2 --> S3[3. Low-Noise<br/>Crawl]
    end
    
    subgraph PAID["PAID (if needed)"]
        S3 --> S4[4. Instant Pages]
        S4 --> S4_5[4.5 Domain Patterns]
        S4_5 --> S5[5. LLM]
    end
    
    S3 -->|"≥70%"| DONE[Done]
    S4 -->|"≥70%"| DONE
    S5 --> DONE
    
    style S3 fill:#c8e6c9
    style S4 fill:#fff3e0
    style S5 fill:#ffcdd2

Stage	Domain	URL (Source)	URL (Target)	Keyword	Cost
0. Cache Check	✓	✓	-	-	FREE
1. Rules Engine	✓	✓	✓	✓	FREE
1.5 Google Ads Categories	✓	-	-	-	FREE
2. Vectorize	✓	✓	-	✓	FREE
3. Low-Noise Crawl	✓	-	-	-	FREE
4. Instant Pages	✓	✓	-	-	$0.000125
4.5 Domain Patterns	✓	-	-	-	FREE
5. LLM Fallback	✓	✓	-	-	~$0.0001

Low-Noise Crawl (Stage 3) - The key cost optimization:

HEAD request + partial GET (Range: 0-8KB)
Extracts <head> metadata: title, description, og:type, generator
CMS detection: WordPress, Shopify, Ghost, Wix, etc.
Handles ~70% of domains without needing Instant Pages

Target URLs are always FREE - rules-only classification, no external calls.

Key Optimizations

Domain Resolution in UI - User sees resolved domain before saving, backend uses as-is
Target URLs Inline - No queue hop for FREE classification
Batch by Domain - One queue message per domain, not per URL
Domain Cache - Classify domain once, reuse for all its URLs
website_name → Brand - Auto-populate brands from DataForSEO data
Domain as Fallback Brand - When no brand name available, use domain itself

Layer Responsibilities​

Base44 (Source of Truth)​

ClickHouse (Analytics Store)​

Worker Glue​

Cloudflare Images (Icon Storage)​

Cloudflare D1 (App Store Data)​

Brand Architecture​

Brand Table Schema​

Brand Creation Flow​

Joining to Brands​

Denormalization Strategy​

Single Source of Truth Principle​

Data Flow Examples​

User Creates a Category in React​

Harvest Completes​

Domain & URL Classification Pipeline​

Entry Points​

Brand Name Resolution​

Domain Resolution Flow (UI Layer)​

Complete Data Flow​

Classification Queue Flow (Optimized)​

Classification Stages & Costs​

Key Optimizations​