Keyword Research System
AI-powered website analysis, category detection, and keyword harvesting pipeline using DataForSEO Labs and LLM enrichment.
Overview
The keyword research system analyzes any website and generates categorized keyword lists with competitive metrics. It combines domain-level keyword discovery (DataForSEO Labs) with AI-generated category-specific keywords.
Key Features:
- Website content analysis and business type detection
- AI-recommended categories based on site content
- Domain keyword seed from DataForSEO
- Category-specific keyword generation
- Intent classification (commercial, informational, transactional, navigational)
- Brand keyword detection
- Monthly search volume and trend data
- Persistence to Base44 (entities) and ClickHouse (analytics)
Core Workflows
1. Stateless Analysis (Fast)
Quick website analysis without persistence.
Endpoint: POST /api/analyze-site
{
"url": "https://example.com"
}
Response:
{
"success": true,
"url": "https://example.com",
"keywords": [
{
"keyword": "project management software",
"search_volume": 12000,
"competition": 0.87,
"category_id": 12015
}
],
"categories": [
{ "id": 12015, "path": "/Business & Industrial/Business Services" }
]
}
Use Case: Onboarding UIs showing instant preview before full harvest.
2. Full Harvest Pipeline (Comprehensive)
Complete workflow with LLM enrichment, category confirmation, and persistence.
sequenceDiagram
participant Client
participant Worker
participant KV
participant R2
participant LLM
participant Queue
participant Base44
participant ClickHouse
Client->>Worker: POST /run {url, project_id}
Worker->>KV: Check idempotency
Worker->>Worker: Fetch HTML (Regex/Instant Pages)
Worker->>R2: Store raw HTML
Worker->>LLM: Extract business type + categories
Worker->>KV: Save run state
Worker-->>Client: {run_id, status: awaiting_category_confirmation}
Client->>Worker: POST /run/{id}/confirm-categories
Worker->>Queue: Enqueue harvest_keywords
Queue->>Worker: Process harvest
Worker->>DataForSEO: Get domain seed keywords
Worker->>LLM: Generate category keywords
Worker->>Worker: Merge, dedupe, enrich
Worker->>Base44: Upsert keywords + relationships
Worker->>ClickHouse: Queue metrics inserts
Worker->>KV: Update status: complete
Client->>Worker: GET /run/{id}/status
Worker-->>Client: {status: complete, harvest: {...}}
Step 1: Initiate Run
Endpoint: POST /run
{
"url": "https://example.com",
"project_id": "proj_123",
"user_id": "user_456"
}
Response:
{
"run_id": "run_abc123",
"status": "awaiting_category_confirmation",
"enrichment": {
"business_type": "SaaS Platform",
"focus": "Project management and team collaboration",
"recommended_categories": [
{
"id": 12015,
"path": "/Business & Industrial/Business Services",
"confidence": 0.92
}
]
}
}
Step 2: Confirm Categories
User reviews and confirms (or modifies) recommended categories.
Endpoint: POST /run/RUN_ID/confirm-categories
{
"categories": [
{
"id": 12015,
"confidence": 0.95,
"assigned_by": "user"
}
],
"locations": [
{ "location_code": 2840, "location_name": "United States" }
],
"app_names": ["Example App", "Example Suite"]
}
Response:
{
"success": true,
"run_id": "run_abc123",
"harvest_queued": true
}
Step 3: Poll Status
Endpoint: GET /run/RUN_ID/status
{
"run_id": "run_abc123",
"status": "complete",
"harvest": {
"total_keywords": 487,
"sources": {
"domain_seed": 123,
"harvest_ai": 364
},
"by_category": {
"12015": 487
}
},
"errors": []
}
3. Keyword Suggestions (Autocomplete)
Get keyword suggestions with category paths.
Endpoint: POST /api/keywords/suggestions
{
"keyword": "project man",
"location_code": 2840,
"limit": 10
}
Response:
{
"suggestions": [
{
"keyword": "project management software",
"search_volume": 12000,
"dataforseo_category_paths": [
"/Business & Industrial/Business Services"
]
}
]
}
Caching: Results cached in KV for 24h per keyword + location.
Data Models
KV Storage (DFS_RUNS)
Key: run:RUN_ID
{
"runId": "run_abc123",
"url": "https://example.com",
"project_id": "proj_123",
"user_id": "user_456",
"status": "complete",
"enrichment": {...},
"confirmed_categories": [...],
"harvest": {...},
"created_at": 1699999999999,
"updated_at": 1700000000000
}
R2 Storage (DFS_RAW_PAYLOADS)
Key: payloads/RUN_ID.html
Raw HTML from Instant Pages or regex scraper.
Base44 Entities
Keyword:
{
text: "project management software",
normalized_text: "project management software",
original_keyword_text: "Project Management Software",
sources: ["domain_seed", "harvest_ai"],
primary_intent: "commercial",
secondary_intents: ["informational"],
brand_flag: false,
dataforseo_category_paths: ["/Business & Industrial/Business Services"],
latest_search_volume: 12000,
latest_competition: 0.87,
latest_cpc: 8.45,
latest_trend: "up",
updated_at: "2024-01-15T..."
}
Relationships:
ProjectKeyword- Links keywords to projects with metadataKeywordCategory- Links keywords to categories with confidence + source
ClickHouse Tables
keyword_snapshots:
CREATE TABLE keyword_snapshots (
snapshot_id UUID,
keyword_id String,
keyword_text String,
search_volume UInt32,
competition Float32,
cpc Float32,
trend String,
snapshot_date Date,
created_at DateTime
) ENGINE = MergeTree()
ORDER BY (keyword_id, snapshot_date);
monthly_keyword_searches:
CREATE TABLE monthly_keyword_searches (
keyword_id String,
year UInt16,
month UInt8,
search_volume UInt32,
created_at DateTime
) ENGINE = MergeTree()
ORDER BY (keyword_id, year, month);
Keyword Processing
Merge & Dedupe Logic
-
Domain seed keywords (from DataForSEO Labs
keywords_for_site)- Source:
domain_seed - Includes search volume, competition, CPC
- Source:
-
AI category keywords (generated per confirmed category)
- Source:
harvest_ai - Enriched with DataForSEO metrics if available
- Source:
-
Normalization:
- Lowercase
- Trim whitespace
- Remove duplicates (case-insensitive)
-
Combined sources:
- If same keyword appears in both →
sources: ["domain_seed", "harvest_ai"] - Preserve original text from first occurrence
- If same keyword appears in both →
Intent Classification
Uses DataForSEO keyword_info intent data:
commercial- Buyer intent ("buy", "price", "review")informational- Learning intent ("how to", "what is", "guide")transactional- Action intent ("download", "sign up", "free trial")navigational- Brand/destination ("facebook login", "gmail")
Storage:
primary_intent- Highest confidence intentsecondary_intents- Array of other intents with confidence > 0.3
Brand Detection
Keyword is flagged as brand_flag: true if:
- Contains confirmed app names (from
/confirm-categoriespayload) - Contains domain name variations
- DataForSEO intent includes high navigational score
Configuration
Environment Variables
[vars]
DATAFORSEO_LABS_ENDPOINT = "https://api.dataforseo.com/v3/dataforseo_labs/google/keywords_for_site/live"
DATAFORSEO_LABS_LIMIT = "100"
DATAFORSEO_LABS_MAX_REQUESTS = "1"
Secrets (via wrangler secret put)
wrangler secret put DATAFORSEO_LOGIN
wrangler secret put DATAFORSEO_PASSWORD
wrangler secret put BASE44_API_URL
wrangler secret put BASE44_JWT_SECRET
wrangler secret put CLICKHOUSE_HOST
wrangler secret put CLICKHOUSE_USER
wrangler secret put CLICKHOUSE_PASSWORD
wrangler secret put CLICKHOUSE_DATABASE
API Reference
POST /run
Initialize keyword harvest run.
Request:
{
"url": "https://example.com",
"project_id": "proj_123",
"user_id": "user_456"
}
Response (202 Accepted):
{
"run_id": "run_abc123",
"status": "awaiting_category_confirmation",
"enrichment": {
"business_type": "SaaS Platform",
"focus": "Project management",
"recommended_categories": [...]
}
}
POST /run/RUN_ID/confirm-categories
Confirm categories and trigger harvest.
Request:
{
"categories": [{...}],
"locations": [{...}],
"app_names": [...]
}
Response (200 OK):
{
"success": true,
"harvest_queued": true
}
GET /run/RUN_ID/status
Poll harvest status.
Response:
{
"run_id": "run_abc123",
"status": "complete",
"harvest": {
"total_keywords": 487,
"sources": {...},
"by_category": {...}
}
}
POST /api/analyze-site
Stateless quick analysis.
Request:
{
"url": "https://example.com"
}
Response:
{
"success": true,
"keywords": [...],
"categories": [...]
}
POST /api/keywords/suggestions
Get keyword autocomplete suggestions.
Request:
{
"keyword": "project man",
"location_code": 2840,
"limit": 10
}
Response:
{
"suggestions": [...]
}
React Integration
See React Client Guide for:
- Onboarding UI flow
- Category confirmation UX
- Status polling patterns
- Keyword display components
Troubleshooting
Run stuck in "awaiting_category_confirmation"
Check: KV run state
wrangler kv:key get --namespace-id=87e701aa4cb241e8a5732ac3d5835c4e "run:run_abc123"
Fix: User must POST to /run/RUN_ID/confirm-categories
Harvest completes but keywords not in Base44
Check: ClickHouse ingestion queue
curl https://your-worker.workers.dev/test/clickhouse
Fix: Base44 upsert failures are logged; check worker logs for auth errors.
DataForSEO quota exceeded
Check: Budget tracking
wrangler kv:key get --namespace-id=786b7e405123458e9e0f1341cb5c094b "budget:daily"
Fix: Adjust DATAFORSEO_LABS_LIMIT or increase DataForSEO account quota.
Performance & Limits
- Domain seed: 1 DataForSEO Labs request per run (~$0.02)
- Category keywords: 1 LLM generation per category
- Enrichment: 1 DataForSEO
keyword_inforequest per 100 keywords - Queue processing: Max 20 concurrent harvest jobs
- ClickHouse batching: 100 rows per insert (reduces request count)
Estimated costs per run:
- DataForSEO: $0.02 - $0.10 (depending on keyword volume)
- Workers AI: ~$0.01 per category
- ClickHouse: Free tier or ~$0.001 per 1000 rows