Skip to main content

Keyword Research System

AI-powered website analysis, category detection, and keyword harvesting pipeline using DataForSEO Labs and LLM enrichment.


Overview

The keyword research system analyzes any website and generates categorized keyword lists with competitive metrics. It combines domain-level keyword discovery (DataForSEO Labs) with AI-generated category-specific keywords.

Key Features:

  • Website content analysis and business type detection
  • AI-recommended categories based on site content
  • Domain keyword seed from DataForSEO
  • Category-specific keyword generation
  • Intent classification (commercial, informational, transactional, navigational)
  • Brand keyword detection
  • Monthly search volume and trend data
  • Persistence to Base44 (entities) and ClickHouse (analytics)

Core Workflows

1. Stateless Analysis (Fast)

Quick website analysis without persistence.

Endpoint: POST /api/analyze-site

{
"url": "https://example.com"
}

Response:

{
"success": true,
"url": "https://example.com",
"keywords": [
{
"keyword": "project management software",
"search_volume": 12000,
"competition": 0.87,
"category_id": 12015
}
],
"categories": [
{ "id": 12015, "path": "/Business & Industrial/Business Services" }
]
}

Use Case: Onboarding UIs showing instant preview before full harvest.


2. Full Harvest Pipeline (Comprehensive)

Complete workflow with LLM enrichment, category confirmation, and persistence.

sequenceDiagram
participant Client
participant Worker
participant KV
participant R2
participant LLM
participant Queue
participant Base44
participant ClickHouse

Client->>Worker: POST /run {url, project_id}
Worker->>KV: Check idempotency
Worker->>Worker: Fetch HTML (Regex/Instant Pages)
Worker->>R2: Store raw HTML
Worker->>LLM: Extract business type + categories
Worker->>KV: Save run state
Worker-->>Client: {run_id, status: awaiting_category_confirmation}

Client->>Worker: POST /run/{id}/confirm-categories
Worker->>Queue: Enqueue harvest_keywords
Queue->>Worker: Process harvest
Worker->>DataForSEO: Get domain seed keywords
Worker->>LLM: Generate category keywords
Worker->>Worker: Merge, dedupe, enrich
Worker->>Base44: Upsert keywords + relationships
Worker->>ClickHouse: Queue metrics inserts
Worker->>KV: Update status: complete

Client->>Worker: GET /run/{id}/status
Worker-->>Client: {status: complete, harvest: {...}}

Step 1: Initiate Run

Endpoint: POST /run

{
"url": "https://example.com",
"project_id": "proj_123",
"user_id": "user_456"
}

Response:

{
"run_id": "run_abc123",
"status": "awaiting_category_confirmation",
"enrichment": {
"business_type": "SaaS Platform",
"focus": "Project management and team collaboration",
"recommended_categories": [
{
"id": 12015,
"path": "/Business & Industrial/Business Services",
"confidence": 0.92
}
]
}
}

Step 2: Confirm Categories

User reviews and confirms (or modifies) recommended categories.

Endpoint: POST /run/RUN_ID/confirm-categories

{
"categories": [
{
"id": 12015,
"confidence": 0.95,
"assigned_by": "user"
}
],
"locations": [
{ "location_code": 2840, "location_name": "United States" }
],
"app_names": ["Example App", "Example Suite"]
}

Response:

{
"success": true,
"run_id": "run_abc123",
"harvest_queued": true
}

Step 3: Poll Status

Endpoint: GET /run/RUN_ID/status

{
"run_id": "run_abc123",
"status": "complete",
"harvest": {
"total_keywords": 487,
"sources": {
"domain_seed": 123,
"harvest_ai": 364
},
"by_category": {
"12015": 487
}
},
"errors": []
}

3. Keyword Suggestions (Autocomplete)

Get keyword suggestions with category paths.

Endpoint: POST /api/keywords/suggestions

{
"keyword": "project man",
"location_code": 2840,
"limit": 10
}

Response:

{
"suggestions": [
{
"keyword": "project management software",
"search_volume": 12000,
"dataforseo_category_paths": [
"/Business & Industrial/Business Services"
]
}
]
}

Caching: Results cached in KV for 24h per keyword + location.


Data Models

KV Storage (DFS_RUNS)

Key: run:RUN_ID

{
"runId": "run_abc123",
"url": "https://example.com",
"project_id": "proj_123",
"user_id": "user_456",
"status": "complete",
"enrichment": {...},
"confirmed_categories": [...],
"harvest": {...},
"created_at": 1699999999999,
"updated_at": 1700000000000
}

R2 Storage (DFS_RAW_PAYLOADS)

Key: payloads/RUN_ID.html

Raw HTML from Instant Pages or regex scraper.

Base44 Entities

Keyword:

{
text: "project management software",
normalized_text: "project management software",
original_keyword_text: "Project Management Software",
sources: ["domain_seed", "harvest_ai"],
primary_intent: "commercial",
secondary_intents: ["informational"],
brand_flag: false,
dataforseo_category_paths: ["/Business & Industrial/Business Services"],
latest_search_volume: 12000,
latest_competition: 0.87,
latest_cpc: 8.45,
latest_trend: "up",
updated_at: "2024-01-15T..."
}

Relationships:

  • ProjectKeyword - Links keywords to projects with metadata
  • KeywordCategory - Links keywords to categories with confidence + source

ClickHouse Tables

keyword_snapshots:

CREATE TABLE keyword_snapshots (
snapshot_id UUID,
keyword_id String,
keyword_text String,
search_volume UInt32,
competition Float32,
cpc Float32,
trend String,
snapshot_date Date,
created_at DateTime
) ENGINE = MergeTree()
ORDER BY (keyword_id, snapshot_date);

monthly_keyword_searches:

CREATE TABLE monthly_keyword_searches (
keyword_id String,
year UInt16,
month UInt8,
search_volume UInt32,
created_at DateTime
) ENGINE = MergeTree()
ORDER BY (keyword_id, year, month);

Keyword Processing

Merge & Dedupe Logic

  1. Domain seed keywords (from DataForSEO Labs keywords_for_site)

    • Source: domain_seed
    • Includes search volume, competition, CPC
  2. AI category keywords (generated per confirmed category)

    • Source: harvest_ai
    • Enriched with DataForSEO metrics if available
  3. Normalization:

    • Lowercase
    • Trim whitespace
    • Remove duplicates (case-insensitive)
  4. Combined sources:

    • If same keyword appears in both → sources: ["domain_seed", "harvest_ai"]
    • Preserve original text from first occurrence

Intent Classification

Uses DataForSEO keyword_info intent data:

  • commercial - Buyer intent ("buy", "price", "review")
  • informational - Learning intent ("how to", "what is", "guide")
  • transactional - Action intent ("download", "sign up", "free trial")
  • navigational - Brand/destination ("facebook login", "gmail")

Storage:

  • primary_intent - Highest confidence intent
  • secondary_intents - Array of other intents with confidence > 0.3

Brand Detection

Keyword is flagged as brand_flag: true if:

  • Contains confirmed app names (from /confirm-categories payload)
  • Contains domain name variations
  • DataForSEO intent includes high navigational score

Configuration

Environment Variables

[vars]
DATAFORSEO_LABS_ENDPOINT = "https://api.dataforseo.com/v3/dataforseo_labs/google/keywords_for_site/live"
DATAFORSEO_LABS_LIMIT = "100"
DATAFORSEO_LABS_MAX_REQUESTS = "1"

Secrets (via wrangler secret put)

wrangler secret put DATAFORSEO_LOGIN
wrangler secret put DATAFORSEO_PASSWORD
wrangler secret put BASE44_API_URL
wrangler secret put BASE44_JWT_SECRET
wrangler secret put CLICKHOUSE_HOST
wrangler secret put CLICKHOUSE_USER
wrangler secret put CLICKHOUSE_PASSWORD
wrangler secret put CLICKHOUSE_DATABASE

API Reference

POST /run

Initialize keyword harvest run.

Request:

{
"url": "https://example.com",
"project_id": "proj_123",
"user_id": "user_456"
}

Response (202 Accepted):

{
"run_id": "run_abc123",
"status": "awaiting_category_confirmation",
"enrichment": {
"business_type": "SaaS Platform",
"focus": "Project management",
"recommended_categories": [...]
}
}

POST /run/RUN_ID/confirm-categories

Confirm categories and trigger harvest.

Request:

{
"categories": [{...}],
"locations": [{...}],
"app_names": [...]
}

Response (200 OK):

{
"success": true,
"harvest_queued": true
}

GET /run/RUN_ID/status

Poll harvest status.

Response:

{
"run_id": "run_abc123",
"status": "complete",
"harvest": {
"total_keywords": 487,
"sources": {...},
"by_category": {...}
}
}

POST /api/analyze-site

Stateless quick analysis.

Request:

{
"url": "https://example.com"
}

Response:

{
"success": true,
"keywords": [...],
"categories": [...]
}

POST /api/keywords/suggestions

Get keyword autocomplete suggestions.

Request:

{
"keyword": "project man",
"location_code": 2840,
"limit": 10
}

Response:

{
"suggestions": [...]
}

React Integration

See React Client Guide for:

  • Onboarding UI flow
  • Category confirmation UX
  • Status polling patterns
  • Keyword display components

Troubleshooting

Run stuck in "awaiting_category_confirmation"

Check: KV run state

wrangler kv:key get --namespace-id=87e701aa4cb241e8a5732ac3d5835c4e "run:run_abc123"

Fix: User must POST to /run/RUN_ID/confirm-categories

Harvest completes but keywords not in Base44

Check: ClickHouse ingestion queue

curl https://your-worker.workers.dev/test/clickhouse

Fix: Base44 upsert failures are logged; check worker logs for auth errors.

DataForSEO quota exceeded

Check: Budget tracking

wrangler kv:key get --namespace-id=786b7e405123458e9e0f1341cb5c094b "budget:daily"

Fix: Adjust DATAFORSEO_LABS_LIMIT or increase DataForSEO account quota.


Performance & Limits

  • Domain seed: 1 DataForSEO Labs request per run (~$0.02)
  • Category keywords: 1 LLM generation per category
  • Enrichment: 1 DataForSEO keyword_info request per 100 keywords
  • Queue processing: Max 20 concurrent harvest jobs
  • ClickHouse batching: 100 rows per insert (reduces request count)

Estimated costs per run:

  • DataForSEO: $0.02 - $0.10 (depending on keyword volume)
  • Workers AI: ~$0.01 per category
  • ClickHouse: Free tier or ~$0.001 per 1000 rows