Skip to main content

ClickHouse Integration

Complete guide to ClickHouse setup, schema, and usage patterns for analytics storage.


Overview

ClickHouse is a columnar database optimized for analytical queries and time-series data. RankFabric uses ClickHouse for:

  • Keyword metrics snapshots - Historical search volume, CPC, competition
  • SERP tracking results - Daily ranking positions and changes
  • App analytics - App ranking history and metrics over time
  • Trend analysis - Monthly search patterns and seasonal data

Why ClickHouse?

  • Extremely fast analytical queries
  • Efficient compression for time-series data
  • Append-only writes (no updates needed)
  • Handles billions of rows easily

Configuration

Environment Variables

Set in wrangler.toml:

[vars]
ENABLE_CLICKHOUSE = "true"
CLICKHOUSE_AUTO_INIT = "true"

Secrets

Store connection details securely:

echo "clickhouse.example.com" | wrangler secret put CLICKHOUSE_HOST
echo "default" | wrangler secret put CLICKHOUSE_USER
echo "your-password" | wrangler secret put CLICKHOUSE_PASSWORD
echo "rankfabric" | wrangler secret put CLICKHOUSE_DATABASE

Recommended Provider: ClickHouse Cloud


Schema

Keyword Snapshots

Historical keyword metrics from DataForSEO.

CREATE TABLE keyword_snapshots (
keyword String,
project_id String,
organization_id String,
search_volume UInt32,
difficulty UInt8,
cpc Float32,
competition Float32,
intent String,
trend Array(UInt32),
monthly_searches Array(Tuple(month UInt8, year UInt16, volume UInt32)),
captured_at DateTime,
location_code UInt32,
language_code String,
source String
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(captured_at)
ORDER BY (organization_id, project_id, keyword, captured_at)
SETTINGS index_granularity = 8192;

Field Descriptions:

  • keyword: Keyword text (lowercase, normalized)
  • project_id: Associated project (from Base44)
  • organization_id: Customer/tenant ID
  • search_volume: Monthly search volume
  • difficulty: SEO difficulty (0-100)
  • cpc: Cost-per-click in USD
  • competition: Paid search competition (0-1)
  • intent: Search intent (informational, commercial, transactional)
  • trend: Last 12 months of search volume
  • monthly_searches: Detailed 72-month history
  • captured_at: When metric was captured
  • location_code: DataForSEO location code
  • language_code: Language (e.g., en)
  • source: Data source (dataforseo, manual)

Example Query:

Get keyword trend over time:

SELECT 
keyword,
toDate(captured_at) as date,
search_volume,
cpc
FROM keyword_snapshots
WHERE project_id = 'proj_123'
AND keyword = 'seo tools'
ORDER BY date DESC
LIMIT 30;

SERP Runs

SERP tracking results and position history.

CREATE TABLE serp_runs (
run_id String,
keyword String,
project_id String,
organization_id String,
url String,
domain String,
position UInt16,
previous_position UInt16,
position_change Int16,
title String,
snippet String,
tracked_at DateTime,
location_code UInt32,
language_code String,
device String,
serp_features Array(String)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(tracked_at)
ORDER BY (organization_id, project_id, keyword, tracked_at)
SETTINGS index_granularity = 8192;

Field Descriptions:

  • run_id: Unique run identifier
  • keyword: Tracked keyword
  • url: Ranking URL
  • domain: Extracted domain
  • position: Current position (1-100)
  • previous_position: Position from last run
  • position_change: Change since last run (+/-)
  • title: Page title in SERP
  • snippet: Meta description snippet
  • tracked_at: When ranking was checked
  • device: desktop, mobile, tablet
  • serp_features: Features shown (e.g., ["featured_snippet", "people_also_ask"])

Example Query:

Get ranking history for a domain:

SELECT 
keyword,
toDate(tracked_at) as date,
position,
position_change
FROM serp_runs
WHERE project_id = 'proj_123'
AND domain = 'example.com'
AND tracked_at > now() - INTERVAL 30 DAY
ORDER BY date DESC, position ASC;

App Analytics

App ranking history and metrics.

CREATE TABLE app_analytics (
app_id String,
platform String,
category_id String,
rank UInt16,
shelf_name String,
rating Float32,
rating_count UInt32,
download_count_exact UInt64,
price Float32,
captured_at DateTime
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(captured_at)
ORDER BY (platform, app_id, captured_at)
SETTINGS index_granularity = 8192;

Field Descriptions:

  • app_id: App identifier
  • platform: google_play or apple
  • category_id: Category where ranked
  • rank: Position in category
  • shelf_name: Chart type (e.g., "Top Free")
  • rating: Average rating (0-5)
  • rating_count: Number of ratings
  • download_count_exact: Exact download count (Google Play)
  • price: App price
  • captured_at: When data was captured

Example Query:

Track app ranking over time:

SELECT 
toDate(captured_at) as date,
category_id,
rank,
rating,
rating_count
FROM app_analytics
WHERE app_id = 'com.instagram.android'
AND platform = 'google_play'
AND captured_at > now() - INTERVAL 90 DAY
ORDER BY date DESC, rank ASC;

Ingestion Queue

ClickHouse writes are batched via Cloudflare Queue for efficiency.

Queue Architecture

Worker → clickhouse-ingestion Queue → Batch Consumer → ClickHouse

Queue: clickhouse-ingestion

Batch Size: 100 records

Timeout: 30 seconds

Batch Processing

Consumer batches multiple records into single INSERT:

async function insertBatch(records, env) {
const ch = createClickHouseClient(env);
const values = records.map(formatRecord).join(',');

await ch.insert(`
INSERT INTO keyword_snapshots
(keyword, project_id, search_volume, ...)
VALUES ${values}
`);
}

Benefits:

  • Reduces ClickHouse connections
  • Improves write throughput
  • Lower CPU usage

Client Library

Create Client

import { createClickHouseClient } from './lib/clickhouse.js';

const ch = createClickHouseClient(env);

Execute Query

const results = await ch.query(`
SELECT keyword, search_volume
FROM keyword_snapshots
WHERE project_id = {project_id:String}
LIMIT 10
`, {
project_id: 'proj_123'
});

Insert Data

await ch.insert(`
INSERT INTO keyword_snapshots (keyword, search_volume, captured_at)
VALUES ('seo tools', 27100, now())
`);

Batch Insert

const records = [
{ keyword: 'seo tools', volume: 27100 },
{ keyword: 'keyword research', volume: 8100 }
];

const values = records.map(r =>
`('${r.keyword}', ${r.volume}, now())`
).join(',');

await ch.insert(`
INSERT INTO keyword_snapshots (keyword, search_volume, captured_at)
VALUES ${values}
`);

Schema Initialization

Auto-initialization

If CLICKHOUSE_AUTO_INIT=true, schema initializes on first request:

// In src/index.js
if (env.ENABLE_CLICKHOUSE === 'true' && env.CLICKHOUSE_AUTO_INIT === 'true') {
const initialized = await env.DFS_RUNS.get('clickhouse-initialized');
if (!initialized) {
const ch = createClickHouseClient(env);
await ch.validateSchema();
await env.DFS_RUNS.put('clickhouse-initialized', 'true');
}
}

Manual Initialization

curl https://your-worker.workers.dev/init/clickhouse

Creates:

  • keyword_snapshots table
  • serp_runs table
  • app_analytics table

Query Patterns

Time-Series Analysis

Get daily keyword volumes:

SELECT 
toDate(captured_at) as date,
avg(search_volume) as avg_volume,
max(search_volume) as peak_volume
FROM keyword_snapshots
WHERE keyword = 'seo tools'
AND captured_at > now() - INTERVAL 90 DAY
GROUP BY date
ORDER BY date DESC;

Aggregations

Top keywords by search volume:

SELECT 
keyword,
max(search_volume) as volume,
avg(cpc) as avg_cpc
FROM keyword_snapshots
WHERE project_id = 'proj_123'
GROUP BY keyword
ORDER BY volume DESC
LIMIT 50;

Window Functions

Calculate position changes:

SELECT 
keyword,
toDate(tracked_at) as date,
position,
lagInFrame(position) OVER (
PARTITION BY keyword
ORDER BY tracked_at
) as previous_position
FROM serp_runs
WHERE project_id = 'proj_123'
ORDER BY keyword, date DESC;

Array Operations

Filter by intent:

SELECT keyword, search_volume
FROM keyword_snapshots
WHERE intent = 'commercial'
AND project_id = 'proj_123'
ORDER BY search_volume DESC;

Performance Optimization

Partitioning

Tables are partitioned by month:

PARTITION BY toYYYYMM(captured_at)

Benefits:

  • Faster queries on recent data
  • Efficient data retention (drop old partitions)
  • Parallel processing

Example: Drop data older than 2 years:

ALTER TABLE keyword_snapshots 
DROP PARTITION '202311';

Indexing

Primary key order matters:

ORDER BY (organization_id, project_id, keyword, captured_at)

Query optimization:

-- Fast (uses primary key)
SELECT * FROM keyword_snapshots
WHERE organization_id = 'org_123'
AND project_id = 'proj_456';

-- Slower (scans all data)
SELECT * FROM keyword_snapshots
WHERE keyword = 'seo tools';

Materialized Views

Create pre-aggregated views for common queries:

CREATE MATERIALIZED VIEW keyword_daily_stats
ENGINE = SummingMergeTree()
ORDER BY (project_id, keyword, date)
AS SELECT
project_id,
keyword,
toDate(captured_at) as date,
avg(search_volume) as avg_volume,
max(search_volume) as max_volume
FROM keyword_snapshots
GROUP BY project_id, keyword, date;

Migration from D1

Migrate existing D1 data to ClickHouse:

curl -X POST https://your-worker.workers.dev/api/admin/migrate-to-clickhouse

Process:

  1. Read data from D1 tables
  2. Transform to ClickHouse format
  3. Batch insert to ClickHouse
  4. Verify row counts
  5. (Optional) Drop D1 tables

Endpoint: src/endpoints/migrate-to-clickhouse.js


Monitoring & Health

Test Connection

curl https://your-worker.workers.dev/test/clickhouse

Response:

{
"status": "ok",
"clickhouse_connected": true,
"server_version": "23.8.2.7",
"database": "rankfabric"
}

Query Logs

ClickHouse query log:

SELECT 
query_start_time,
query_duration_ms,
query,
exception
FROM system.query_log
WHERE type = 'QueryFinish'
AND query_duration_ms > 1000
ORDER BY query_start_time DESC
LIMIT 10;

Table Sizes

Check table sizes:

SELECT 
table,
formatReadableSize(sum(bytes)) as size,
sum(rows) as rows
FROM system.parts
WHERE database = 'rankfabric'
GROUP BY table;

Cost Optimization

ClickHouse Cloud Pricing

Approximate costs:

  • Development: $0.25/hour (~$180/month)
  • Production: $1/hour (~$720/month)
  • Storage: $0.10/GB/month

Reduce Costs

  1. Use partitioning to drop old data:

    ALTER TABLE keyword_snapshots DROP PARTITION '202301';
  2. Optimize compression:

    OPTIMIZE TABLE keyword_snapshots FINAL;
  3. Use sampling for exploratory queries:

    SELECT * FROM keyword_snapshots SAMPLE 0.1
    WHERE project_id = 'proj_123';
  4. Right-size your cluster:

    • Start with smallest instance
    • Scale up based on query performance

Backup & Retention

Backup Strategy

ClickHouse Cloud auto-backs up data daily.

Manual backup:

clickhouse-client --query "SELECT * FROM keyword_snapshots FORMAT CSV" > backup.csv

Retention Policy

Recommended retention:

  • Keyword snapshots: 2 years
  • SERP runs: 1 year
  • App analytics: Indefinite (minimal size)

Implement with TTL:

ALTER TABLE keyword_snapshots 
MODIFY TTL captured_at + INTERVAL 2 YEAR;

Troubleshooting

Connection Errors

Error: Connection refused

Fix:

  1. Verify ClickHouse host is accessible
  2. Check firewall rules
  3. Verify credentials

Slow Queries

Error: Query takes >10 seconds

Fix:

  1. Add WHERE clauses for organization_id/project_id
  2. Use DATE filters to reduce scanned rows
  3. Create materialized views for complex aggregations

Insert Failures

Error: Too many parts

Fix:

-- Merge small parts
OPTIMIZE TABLE keyword_snapshots;

-- Or increase merge settings
ALTER TABLE keyword_snapshots
MODIFY SETTING parts_to_throw_insert = 3000;

File Locations

  • Client library: src/lib/clickhouse.js
  • Schema definitions: src/lib/clickhouse.jsvalidateSchema()
  • Queue consumer: src/queue/clickhouse-ingestion.js
  • Migration tool: src/endpoints/migrate-to-clickhouse.js

See Also