Skip to main content

Database Schema Reference

Complete database schema documentation for D1 and ClickHouse.


Table of Contents


D1 Schema

D1 stores operational data: app metadata, categories, rankings, crawl state, and jobs.

Apps

Primary app metadata table.

CREATE TABLE apps (
id TEXT PRIMARY KEY,
platform TEXT NOT NULL,
title TEXT,
subtitle TEXT,
developer_name TEXT,
developer_id TEXT,
icon_url TEXT,
rating REAL,
rating_count INTEGER,
price REAL,
is_free INTEGER,
primary_category TEXT,
description TEXT,
description_full TEXT,
website_url TEXT,
store_url TEXT,

-- Google Play Enhanced Fields
download_count TEXT,
download_count_exact INTEGER,
in_app_purchase_range TEXT,
iap_price_min REAL,
iap_price_max REAL,
iap_price_avg REAL,
feature_tags TEXT,

-- Apple-Specific Fields
bundle_id TEXT,
version TEXT,
size TEXT,
age_rating TEXT,
screenshots TEXT,

-- Derived Fields
best_rank INTEGER,
total_categories INTEGER,
total_shelves INTEGER,
total_positions INTEGER,
weighted_score REAL,

-- Metadata
discovered_from TEXT,
crawl_project_id TEXT,
created_at INTEGER,
updated_at INTEGER,
last_crawled_at INTEGER
);

CREATE INDEX idx_apps_platform ON apps(platform);
CREATE INDEX idx_apps_primary_category ON apps(primary_category);
CREATE INDEX idx_apps_best_rank ON apps(best_rank);
CREATE INDEX idx_apps_developer ON apps(developer_id);
CREATE INDEX idx_apps_crawl_project ON apps(crawl_project_id);

Field Descriptions:

Core Fields:

  • id: App identifier (bundle ID for Google Play, numeric ID for Apple)
  • platform: google_play or apple
  • title: App name
  • subtitle: Short tagline (Apple) or NULL (Google Play)
  • developer_name: Developer/publisher name
  • developer_id: Developer identifier
  • icon_url: App icon URL
  • rating: Average rating (0-5)
  • rating_count: Number of ratings
  • price: App price (0 for free)
  • is_free: Boolean flag (1 = free, 0 = paid)
  • primary_category: Primary category ID
  • description: Short description (first 2000 chars)
  • description_full: Complete description
  • website_url: Developer website
  • store_url: App store URL

Google Play Enhanced Fields:

  • download_count: Display string (e.g., "1B+", "500M+")
  • download_count_exact: Actual download count (e.g., 3984720514)
  • in_app_purchase_range: IAP range string (e.g., "$0.05 - $1,209.99 per item")
  • iap_price_min: Minimum IAP price
  • iap_price_max: Maximum IAP price
  • iap_price_avg: Average IAP price
  • feature_tags: JSON array of feature tags (e.g., ["#1 top grossing social"])

Apple-Specific Fields:

  • bundle_id: Apple bundle identifier
  • version: App version number
  • size: App file size
  • age_rating: Age rating (e.g., "4+", "12+", "17+")
  • screenshots: JSON array of screenshot URLs

Derived Fields:

  • best_rank: Best position across all categories (lowest number = best)
  • total_categories: Number of categories app appears in
  • total_shelves: Number of unique shelves (e.g., Top Free, Top Paid)
  • total_positions: Total shelf placements across all categories
  • weighted_score: Ranking algorithm score (see Weighted Score)

Metadata:

  • discovered_from: category, recommendations, search
  • crawl_project_id: Associated crawl project ID
  • created_at: Unix timestamp (milliseconds)
  • updated_at: Last update timestamp
  • last_crawled_at: Last crawl timestamp

App Store Categories

Category taxonomy for Apple App Store and Google Play.

CREATE TABLE app_store_categories (
id TEXT PRIMARY KEY,
name TEXT NOT NULL,
platform TEXT NOT NULL,
device TEXT,
category_type TEXT,
parent_ids TEXT,
primary_parent_id TEXT,
path TEXT,
level INTEGER,
created_at INTEGER,
updated_at INTEGER
);

CREATE INDEX idx_categories_platform ON app_store_categories(platform);
CREATE INDEX idx_categories_parent ON app_store_categories(primary_parent_id);
CREATE INDEX idx_categories_type ON app_store_categories(category_type);

Field Descriptions:

  • id: Category identifier (e.g., google_play_weather, apple_games_action)
  • name: Display name (e.g., "Weather", "Action Games")
  • platform: google_play or apple
  • device: phone, iphone, ipad, or NULL (all devices)
  • category_type: category, subcategory, grouping, room, story, chart
  • parent_ids: JSON array of parent category IDs (source of truth)
  • primary_parent_id: First parent (derived from parent_ids[0])
  • path: Display path (e.g., "Games > Action")
  • level: Depth in hierarchy (0 = root, 1 = child, etc.)

Category Hierarchy Model:

Categories are organized in a hierarchy with multiple inheritance support:

  • Source of Truth: parent_ids array contains all parent category IDs
  • Convenience Field: primary_parent_id is derived from parent_ids[0]
  • Relationships: Stored in category_relationships table for querying

Example:

{
"id": "apple_games_action_adventure",
"name": "Action Adventure",
"parent_ids": ["apple_games", "apple_games_action"],
"primary_parent_id": "apple_games"
}

Category Types:

  • category: Top-level category (e.g., "Games", "Social")
  • subcategory: Sub-category (e.g., "Action Games")
  • grouping: Apple grouping container
  • room: Apple room (curated collection)
  • story: Apple story (editorial feature)
  • chart: Chart type (e.g., "Top Free", "Top Paid")

Category Relationships

Explicit parent-child relationships for category hierarchy.

CREATE TABLE category_relationships (
id INTEGER PRIMARY KEY AUTOINCREMENT,
category_id TEXT NOT NULL,
parent_category_id TEXT NOT NULL,
platform TEXT NOT NULL,
device TEXT,
relationship_type TEXT,
created_at INTEGER,
UNIQUE(category_id, parent_category_id)
);

CREATE INDEX idx_relationships_category ON category_relationships(category_id);
CREATE INDEX idx_relationships_parent ON category_relationships(parent_category_id);

Field Descriptions:

  • category_id: Child category ID
  • parent_category_id: Parent category ID
  • relationship_type: primary, secondary, chart, editorial

Usage:

Query all parents of a category:

SELECT parent_category_id 
FROM category_relationships
WHERE category_id = 'apple_games_action';

Query all children of a category:

SELECT category_id 
FROM category_relationships
WHERE parent_category_id = 'apple_games';

App Category Rankings

App positions in category charts.

CREATE TABLE app_category_rankings (
id INTEGER PRIMARY KEY AUTOINCREMENT,
app_id TEXT NOT NULL,
platform TEXT NOT NULL,
category_id TEXT NOT NULL,
app_rank INTEGER,
category_position INTEGER,
shelf_name TEXT,
category_name TEXT,
category_type TEXT,
crawl_date INTEGER,
created_at INTEGER,
UNIQUE(app_id, category_id, shelf_name, crawl_date)
);

CREATE INDEX idx_rankings_app ON app_category_rankings(app_id);
CREATE INDEX idx_rankings_category ON app_category_rankings(category_id);
CREATE INDEX idx_rankings_rank ON app_category_rankings(app_rank);

Field Descriptions:

  • app_id: App identifier
  • category_id: Category identifier
  • app_rank: Position in category (1 = #1)
  • category_position: Position within specific shelf
  • shelf_name: Shelf name (e.g., "Top Free", "Top Paid", "Top Grossing")
  • category_name: Display name of category
  • category_type: Type of category
  • crawl_date: Date of crawl (Unix timestamp, milliseconds)

Example:

{
"app_id": "com.instagram.android",
"category_id": "google_play_social",
"app_rank": 2,
"category_position": 2,
"shelf_name": "Top Free",
"category_name": "Social"
}

App Recommendations

Apps recommended on other app pages.

CREATE TABLE app_recommendations (
id INTEGER PRIMARY KEY AUTOINCREMENT,
app_id TEXT NOT NULL,
recommended_app_id TEXT NOT NULL,
position INTEGER,
title TEXT,
icon_url TEXT,
rating REAL,
platform TEXT,
created_at INTEGER,
UNIQUE(app_id, recommended_app_id, position)
);

CREATE INDEX idx_recommendations_app ON app_recommendations(app_id);
CREATE INDEX idx_recommendations_recommended ON app_recommendations(recommended_app_id);

Field Descriptions:

  • app_id: App that shows the recommendation
  • recommended_app_id: App being recommended
  • position: Position in recommendations list (1-based)

Usage:

Query recommendations shown on an app page:

SELECT * FROM app_recommendations WHERE app_id = 'com.example.app';

Query how many times an app is recommended:

SELECT COUNT(*) FROM app_recommendations WHERE recommended_app_id = 'com.example.app';

Crawl Management Tables

Crawl Projects

CREATE TABLE crawl_projects (
id TEXT PRIMARY KEY,
platform TEXT NOT NULL,
device TEXT,
status TEXT,
categories_queued INTEGER,
related_app_depth INTEGER,
created_at INTEGER,
completed_at INTEGER
);

Crawl Runs

CREATE TABLE crawl_runs (
id TEXT PRIMARY KEY,
project_id TEXT,
category_id TEXT,
status TEXT,
apps_discovered INTEGER,
error TEXT,
started_at INTEGER,
completed_at INTEGER
);

Jobs

Generic job queue tracking.

CREATE TABLE jobs (
id TEXT PRIMARY KEY,
type TEXT NOT NULL,
status TEXT NOT NULL,
payload TEXT,
result TEXT,
error TEXT,
retries INTEGER,
created_at INTEGER,
started_at INTEGER,
completed_at INTEGER
);

CREATE INDEX idx_jobs_type ON jobs(type);
CREATE INDEX idx_jobs_status ON jobs(status);

Field Descriptions:

  • id: Job UUID
  • type: Job type (e.g., harvest_keywords, serp_tracking, app_crawl)
  • status: pending, running, completed, failed
  • payload: JSON job parameters
  • result: JSON job result
  • error: Error message if failed
  • retries: Number of retry attempts

Derived Fields

Weighted Score

The weighted_score is a ranking quality metric calculated as:

weighted_score = 1.0 - (best_rank / 500)
  • Apps ranked #1 → score = 0.998
  • Apps ranked #50 → score = 0.90
  • Apps ranked #500 → score = 0.00

Used for sorting apps by overall ranking strength.


ClickHouse Schema

ClickHouse stores analytics and time-series data.

Keyword Snapshots

Historical keyword metrics.

CREATE TABLE keyword_snapshots (
keyword String,
project_id String,
search_volume UInt32,
difficulty UInt8,
cpc Float32,
competition Float32,
trend Array(UInt32),
captured_at DateTime,
location_code UInt32,
language_code String
) ENGINE = MergeTree()
ORDER BY (project_id, keyword, captured_at);

SERP Runs

SERP tracking results.

CREATE TABLE serp_runs (
run_id String,
keyword String,
project_id String,
url String,
position UInt16,
title String,
snippet String,
tracked_at DateTime,
location_code UInt32,
device String
) ENGINE = MergeTree()
ORDER BY (project_id, keyword, tracked_at);

App Analytics

App ranking history and metrics.

CREATE TABLE app_analytics (
app_id String,
platform String,
category_id String,
rank UInt16,
rating Float32,
rating_count UInt32,
download_count_exact UInt64,
captured_at DateTime
) ENGINE = MergeTree()
ORDER BY (app_id, captured_at);

Data Ownership

Different storage layers own different types of data:

StorePurposeTables
Base44Canonical entitiesKeywords, Categories, Projects, relationships
ClickHouseAnalytics, time-serieskeyword_snapshots, serp_runs, app_analytics
D1Operational stateapps, categories, rankings, jobs, crawl state
KVRun state, cachingDFS_RUNS, DFS_BUDGETS, category mappings
R2Raw payloadsHTML snapshots from scraping

See Data Architecture for detailed responsibilities.


Migration Notes

  • ClickHouse Schema: Initialize with GET /init/clickhouse
  • D1 Migrations: Located in migrations/ directory
  • Migration Status: Track migration completeness via ClickHouse tables

Schema Updates

When adding new fields:

  1. Create migration file in migrations/
  2. Update this documentation
  3. Update TypeScript types if applicable
  4. Deploy with wrangler deploy
  5. Run migration with appropriate endpoint

See Also