ClickHouse Integration
Complete guide to ClickHouse setup, schema, and usage patterns for analytics storage.
Overview
ClickHouse is a columnar database optimized for analytical queries and time-series data. RankFabric uses ClickHouse for:
- Keyword metrics snapshots - Historical search volume, CPC, competition
- SERP tracking results - Daily ranking positions and changes
- App analytics - App ranking history and metrics over time
- Trend analysis - Monthly search patterns and seasonal data
Why ClickHouse?
- Extremely fast analytical queries
- Efficient compression for time-series data
- Append-only writes (no updates needed)
- Handles billions of rows easily
Configuration
Environment Variables
Set in wrangler.toml:
[vars]
ENABLE_CLICKHOUSE = "true"
CLICKHOUSE_AUTO_INIT = "true"
Secrets
Store connection details securely:
echo "clickhouse.example.com" | wrangler secret put CLICKHOUSE_HOST
echo "default" | wrangler secret put CLICKHOUSE_USER
echo "your-password" | wrangler secret put CLICKHOUSE_PASSWORD
echo "rankfabric" | wrangler secret put CLICKHOUSE_DATABASE
Recommended Provider: ClickHouse Cloud
Schema
Keyword Snapshots
Historical keyword metrics from DataForSEO.
CREATE TABLE keyword_snapshots (
keyword String,
project_id String,
organization_id String,
search_volume UInt32,
difficulty UInt8,
cpc Float32,
competition Float32,
intent String,
trend Array(UInt32),
monthly_searches Array(Tuple(month UInt8, year UInt16, volume UInt32)),
captured_at DateTime,
location_code UInt32,
language_code String,
source String
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(captured_at)
ORDER BY (organization_id, project_id, keyword, captured_at)
SETTINGS index_granularity = 8192;
Field Descriptions:
keyword: Keyword text (lowercase, normalized)project_id: Associated project (from Base44)organization_id: Customer/tenant IDsearch_volume: Monthly search volumedifficulty: SEO difficulty (0-100)cpc: Cost-per-click in USDcompetition: Paid search competition (0-1)intent: Search intent (informational,commercial,transactional)trend: Last 12 months of search volumemonthly_searches: Detailed 72-month historycaptured_at: When metric was capturedlocation_code: DataForSEO location codelanguage_code: Language (e.g.,en)source: Data source (dataforseo,manual)
Example Query:
Get keyword trend over time:
SELECT
keyword,
toDate(captured_at) as date,
search_volume,
cpc
FROM keyword_snapshots
WHERE project_id = 'proj_123'
AND keyword = 'seo tools'
ORDER BY date DESC
LIMIT 30;
SERP Runs
SERP tracking results and position history.
CREATE TABLE serp_runs (
run_id String,
keyword String,
project_id String,
organization_id String,
url String,
domain String,
position UInt16,
previous_position UInt16,
position_change Int16,
title String,
snippet String,
tracked_at DateTime,
location_code UInt32,
language_code String,
device String,
serp_features Array(String)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(tracked_at)
ORDER BY (organization_id, project_id, keyword, tracked_at)
SETTINGS index_granularity = 8192;
Field Descriptions:
run_id: Unique run identifierkeyword: Tracked keywordurl: Ranking URLdomain: Extracted domainposition: Current position (1-100)previous_position: Position from last runposition_change: Change since last run (+/-)title: Page title in SERPsnippet: Meta description snippettracked_at: When ranking was checkeddevice:desktop,mobile,tabletserp_features: Features shown (e.g.,["featured_snippet", "people_also_ask"])
Example Query:
Get ranking history for a domain:
SELECT
keyword,
toDate(tracked_at) as date,
position,
position_change
FROM serp_runs
WHERE project_id = 'proj_123'
AND domain = 'example.com'
AND tracked_at > now() - INTERVAL 30 DAY
ORDER BY date DESC, position ASC;
App Analytics
App ranking history and metrics.
CREATE TABLE app_analytics (
app_id String,
platform String,
category_id String,
rank UInt16,
shelf_name String,
rating Float32,
rating_count UInt32,
download_count_exact UInt64,
price Float32,
captured_at DateTime
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(captured_at)
ORDER BY (platform, app_id, captured_at)
SETTINGS index_granularity = 8192;
Field Descriptions:
app_id: App identifierplatform:google_playorapplecategory_id: Category where rankedrank: Position in categoryshelf_name: Chart type (e.g., "Top Free")rating: Average rating (0-5)rating_count: Number of ratingsdownload_count_exact: Exact download count (Google Play)price: App pricecaptured_at: When data was captured
Example Query:
Track app ranking over time:
SELECT
toDate(captured_at) as date,
category_id,
rank,
rating,
rating_count
FROM app_analytics
WHERE app_id = 'com.instagram.android'
AND platform = 'google_play'
AND captured_at > now() - INTERVAL 90 DAY
ORDER BY date DESC, rank ASC;
Ingestion Queue
ClickHouse writes are batched via Cloudflare Queue for efficiency.
Queue Architecture
Worker → clickhouse-ingestion Queue → Batch Consumer → ClickHouse
Queue: clickhouse-ingestion
Batch Size: 100 records
Timeout: 30 seconds
Batch Processing
Consumer batches multiple records into single INSERT:
async function insertBatch(records, env) {
const ch = createClickHouseClient(env);
const values = records.map(formatRecord).join(',');
await ch.insert(`
INSERT INTO keyword_snapshots
(keyword, project_id, search_volume, ...)
VALUES ${values}
`);
}
Benefits:
- Reduces ClickHouse connections
- Improves write throughput
- Lower CPU usage
Client Library
Create Client
import { createClickHouseClient } from './lib/clickhouse.js';
const ch = createClickHouseClient(env);
Execute Query
const results = await ch.query(`
SELECT keyword, search_volume
FROM keyword_snapshots
WHERE project_id = {project_id:String}
LIMIT 10
`, {
project_id: 'proj_123'
});
Insert Data
await ch.insert(`
INSERT INTO keyword_snapshots (keyword, search_volume, captured_at)
VALUES ('seo tools', 27100, now())
`);
Batch Insert
const records = [
{ keyword: 'seo tools', volume: 27100 },
{ keyword: 'keyword research', volume: 8100 }
];
const values = records.map(r =>
`('${r.keyword}', ${r.volume}, now())`
).join(',');
await ch.insert(`
INSERT INTO keyword_snapshots (keyword, search_volume, captured_at)
VALUES ${values}
`);
Schema Initialization
Auto-initialization
If CLICKHOUSE_AUTO_INIT=true, schema initializes on first request:
// In src/index.js
if (env.ENABLE_CLICKHOUSE === 'true' && env.CLICKHOUSE_AUTO_INIT === 'true') {
const initialized = await env.DFS_RUNS.get('clickhouse-initialized');
if (!initialized) {
const ch = createClickHouseClient(env);
await ch.validateSchema();
await env.DFS_RUNS.put('clickhouse-initialized', 'true');
}
}
Manual Initialization
curl https://your-worker.workers.dev/init/clickhouse
Creates:
keyword_snapshotstableserp_runstableapp_analyticstable
Query Patterns
Time-Series Analysis
Get daily keyword volumes:
SELECT
toDate(captured_at) as date,
avg(search_volume) as avg_volume,
max(search_volume) as peak_volume
FROM keyword_snapshots
WHERE keyword = 'seo tools'
AND captured_at > now() - INTERVAL 90 DAY
GROUP BY date
ORDER BY date DESC;
Aggregations
Top keywords by search volume:
SELECT
keyword,
max(search_volume) as volume,
avg(cpc) as avg_cpc
FROM keyword_snapshots
WHERE project_id = 'proj_123'
GROUP BY keyword
ORDER BY volume DESC
LIMIT 50;
Window Functions
Calculate position changes:
SELECT
keyword,
toDate(tracked_at) as date,
position,
lagInFrame(position) OVER (
PARTITION BY keyword
ORDER BY tracked_at
) as previous_position
FROM serp_runs
WHERE project_id = 'proj_123'
ORDER BY keyword, date DESC;
Array Operations
Filter by intent:
SELECT keyword, search_volume
FROM keyword_snapshots
WHERE intent = 'commercial'
AND project_id = 'proj_123'
ORDER BY search_volume DESC;
Performance Optimization
Partitioning
Tables are partitioned by month:
PARTITION BY toYYYYMM(captured_at)
Benefits:
- Faster queries on recent data
- Efficient data retention (drop old partitions)
- Parallel processing
Example: Drop data older than 2 years:
ALTER TABLE keyword_snapshots
DROP PARTITION '202311';
Indexing
Primary key order matters:
ORDER BY (organization_id, project_id, keyword, captured_at)
Query optimization:
-- Fast (uses primary key)
SELECT * FROM keyword_snapshots
WHERE organization_id = 'org_123'
AND project_id = 'proj_456';
-- Slower (scans all data)
SELECT * FROM keyword_snapshots
WHERE keyword = 'seo tools';
Materialized Views
Create pre-aggregated views for common queries:
CREATE MATERIALIZED VIEW keyword_daily_stats
ENGINE = SummingMergeTree()
ORDER BY (project_id, keyword, date)
AS SELECT
project_id,
keyword,
toDate(captured_at) as date,
avg(search_volume) as avg_volume,
max(search_volume) as max_volume
FROM keyword_snapshots
GROUP BY project_id, keyword, date;
Migration from D1
Migrate existing D1 data to ClickHouse:
curl -X POST https://your-worker.workers.dev/api/admin/migrate-to-clickhouse
Process:
- Read data from D1 tables
- Transform to ClickHouse format
- Batch insert to ClickHouse
- Verify row counts
- (Optional) Drop D1 tables
Endpoint: src/endpoints/migrate-to-clickhouse.js
Monitoring & Health
Test Connection
curl https://your-worker.workers.dev/test/clickhouse
Response:
{
"status": "ok",
"clickhouse_connected": true,
"server_version": "23.8.2.7",
"database": "rankfabric"
}
Query Logs
ClickHouse query log:
SELECT
query_start_time,
query_duration_ms,
query,
exception
FROM system.query_log
WHERE type = 'QueryFinish'
AND query_duration_ms > 1000
ORDER BY query_start_time DESC
LIMIT 10;
Table Sizes
Check table sizes:
SELECT
table,
formatReadableSize(sum(bytes)) as size,
sum(rows) as rows
FROM system.parts
WHERE database = 'rankfabric'
GROUP BY table;
Cost Optimization
ClickHouse Cloud Pricing
Approximate costs:
- Development: $0.25/hour (~$180/month)
- Production: $1/hour (~$720/month)
- Storage: $0.10/GB/month
Reduce Costs
-
Use partitioning to drop old data:
ALTER TABLE keyword_snapshots DROP PARTITION '202301'; -
Optimize compression:
OPTIMIZE TABLE keyword_snapshots FINAL; -
Use sampling for exploratory queries:
SELECT * FROM keyword_snapshots SAMPLE 0.1
WHERE project_id = 'proj_123'; -
Right-size your cluster:
- Start with smallest instance
- Scale up based on query performance
Backup & Retention
Backup Strategy
ClickHouse Cloud auto-backs up data daily.
Manual backup:
clickhouse-client --query "SELECT * FROM keyword_snapshots FORMAT CSV" > backup.csv
Retention Policy
Recommended retention:
- Keyword snapshots: 2 years
- SERP runs: 1 year
- App analytics: Indefinite (minimal size)
Implement with TTL:
ALTER TABLE keyword_snapshots
MODIFY TTL captured_at + INTERVAL 2 YEAR;
Troubleshooting
Connection Errors
Error: Connection refused
Fix:
- Verify ClickHouse host is accessible
- Check firewall rules
- Verify credentials
Slow Queries
Error: Query takes >10 seconds
Fix:
- Add WHERE clauses for organization_id/project_id
- Use DATE filters to reduce scanned rows
- Create materialized views for complex aggregations
Insert Failures
Error: Too many parts
Fix:
-- Merge small parts
OPTIMIZE TABLE keyword_snapshots;
-- Or increase merge settings
ALTER TABLE keyword_snapshots
MODIFY SETTING parts_to_throw_insert = 3000;
File Locations
- Client library:
src/lib/clickhouse.js - Schema definitions:
src/lib/clickhouse.js→validateSchema() - Queue consumer:
src/queue/clickhouse-ingestion.js - Migration tool:
src/endpoints/migrate-to-clickhouse.js
See Also
- Data Architecture - Storage layer responsibilities
- Schema Reference - Complete schema documentation
- Diagnostics - Health checks and monitoring
- ClickHouse Official Docs