Keyword Classification System
Multi-dimensional keyword intelligence using a 4-stage classification pipeline.
Overview
The keyword classification system provides:
- Multi-Dimensional Classification - 20+ dimensions covering intent, journey, behavior, and more
- 4-Stage Pipeline - Rules → Vectorize → Brand Lookup → LLM (progressive enhancement)
- Self-Learning - High-confidence LLM results feed back into Vectorize
- Implicit Triggers - Keywords with
classification_completed_at IS NULLget processed automatically
Key Capabilities:
- Journey moment tracking (12 granular stages vs 5 broad funnel stages)
- Expertise level and buyer behavior profiling
- Brand mention detection
- Query specificity (head/torso/long-tail)
- Content decay prediction
- Commercial value indexing
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Keyword Classification │
├─────────────────────────────────────────────────────────────────┤
│ Classification Triggers: │
│ - New keywords from ranked_keywords endpoint │
│ - New keywords from SERP tracking │
│ - Keywords with classification_completed_at IS NULL │
│ │
│ 4-Stage Pipeline: │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Stage 1 │──▶│ Stage 2 │──▶│ Stage 3 │──▶│ Stage 4 │ │
│ │ Rules │ │ Vectorize│ │ Brand │ │ LLM │ │
│ │ (free) │ │ (cheap) │ │ (free) │ │(expensive│ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │ │ │
│ └──────────────┴──────────────┴──────────────┘ │
│ │ │
│ Final Classification │
│ │ │
│ ┌────────▼────────┐ │
│ │ Learning Loop │ │
│ │ (High confidence│ │
│ │ → Vectorize) │ │
│ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Classification Dimensions
v2 Refined Dimensions (Recommended)
Split from compound v1 dimensions for better filtering and analysis.
Journey Tracking
| Dimension | Values | Description |
|---|---|---|
journey_moment | 12 values | Specific point in buyer journey |
journey_direction | entering, progressing, stalled, regressing, exiting | Movement through journey |
Journey Moments:
- Awareness: problem_unaware, problem_aware, solution_curious
- Consideration: category_exploring, feature_comparing, option_evaluating
- Decision: provider_selecting, purchase_ready, purchase_completing
- Post-Purchase: onboarding, optimizing, advocating
Searcher Profile
| Dimension | Values | Description |
|---|---|---|
expertise_level | novice, intermediate, advanced, expert | Knowledge level |
buyer_behavior | impulsive, methodical, price_sensitive, quality_focused, brand_loyal, research_heavy | Purchase approach |
role_context | individual_consumer, small_business_owner, enterprise_buyer, developer, marketer, student, professional, hobbyist | Who is searching |
Content & Demand
| Dimension | Values | Description |
|---|---|---|
demand_pattern | evergreen, seasonal_recurring, event_driven, trending_spike, declining | When demand occurs |
content_decay | timeless, slow_decay, moderate_decay, fast_decay, real_time | How quickly content goes stale |
query_specificity | head, torso, long_tail, ultra_long_tail | Query length/specificity |
Brand Detection
| Dimension | Values | Description |
|---|---|---|
has_brand_mention | boolean | Does keyword contain a brand name |
brand_mentioned | text | The actual brand name if detected |
v1 Dimensions (Backwards Compatible)
These are still populated but the v2 dimensions provide more granular filtering.
| Dimension | Description |
|---|---|
funnel_stage | awareness, consideration, decision, retention, advocacy |
intent_type | informational, navigational, commercial_investigation, transactional, local, support |
persona | researcher_beginner, practitioner, buyer_decision_maker, etc. |
keyword_pattern | question, comparison, superlative, problem, use_case, etc. |
ranking_difficulty | very_easy, easy, medium, hard, very_hard |
trend_velocity | rising_fast, rising_slow, stable, declining, volatile |
seasonality | evergreen, seasonal, event_driven |
competitive_density | low, medium, high, dominated |
commercial_value_index | numeric (volume × CPC) |
sentiment_intent | positive, neutral, negative, urgent |
content_format_intent | listicle, comparison_review, how_to_guide, video_preferred, etc. |
serp_intent_signals | array of SERP features present |
Pipeline Stages
Stage 1: Rules Engine (Free)
Handles ~60% of classifications using regex patterns and DataForSEO signals.
Classifies:
- keyword_pattern (regex-based)
- ranking_difficulty (from KD score)
- trend_velocity (from monthly_searches)
- demand_pattern (from search trends)
- query_specificity (from token count + volume)
- intent_type (from DataForSEO + patterns)
- sentiment_intent (from keyword text)
- content_format_intent (from patterns + SERP)
- competitive_density (from competition + CPC)
Files:
src/lib/keyword-classifier-rules.jssrc/lib/keyword-classification-constants.js
Stage 2: Vectorize (Cheap, ~$0.00001/query)
Uses Cloudflare Vectorize for semantic similarity matching.
How it works:
- Convert keyword + metadata to feature text
- Embed using Workers AI (bge-base-en-v1.5)
- Find similar keywords in Vectorize index
- Vote on classification from top-k matches
Classifies:
- expertise_level
- buyer_behavior
- role_context
- journey_moment
- topic_entity_type
- use_case_type
Files:
src/lib/keyword-classifier-vectorize.js
Self-Learning: High-confidence LLM classifications (>80%) are added to the Vectorize index, making the system smarter over time.
Stage 3: Brand Lookup (Free)
Checks if keyword mentions a known brand from the brands table.
How it works:
- Tokenize keyword
- Try exact brand name matches
- Try prefix matches with LIKE
- Return brand_id and confidence
Classifies:
- has_brand_mention
- brand_mentioned
Stage 4: LLM (Expensive, ~$0.0001/query)
Workers AI fallback for semantic dimensions that rules/vectorize can't handle.
Classifies (when needed):
- expertise_level
- buyer_behavior
- journey_moment
- topic_entity_type
- use_case_type
Files:
src/lib/keyword-classifier-llm.js(to be implemented)
Database Schema
Keywords Table Classification Columns
-- v1 dimensions (migration 0093)
ALTER TABLE keywords ADD COLUMN classification_funnel_stage TEXT;
ALTER TABLE keywords ADD COLUMN classification_funnel_stage_confidence REAL;
ALTER TABLE keywords ADD COLUMN classification_intent_type TEXT;
ALTER TABLE keywords ADD COLUMN classification_keyword_pattern TEXT;
ALTER TABLE keywords ADD COLUMN classification_ranking_difficulty TEXT;
ALTER TABLE keywords ADD COLUMN classification_trend_velocity TEXT;
ALTER TABLE keywords ADD COLUMN classification_seasonality TEXT;
ALTER TABLE keywords ADD COLUMN classification_commercial_value_index REAL;
ALTER TABLE keywords ADD COLUMN classification_competitive_density TEXT;
ALTER TABLE keywords ADD COLUMN classification_sentiment_intent TEXT;
ALTER TABLE keywords ADD COLUMN classification_content_format_intent TEXT;
ALTER TABLE keywords ADD COLUMN classification_serp_intent_signals TEXT; -- JSON array
-- v2 refined dimensions (migration 0094)
ALTER TABLE keywords ADD COLUMN classification_demand_pattern TEXT;
ALTER TABLE keywords ADD COLUMN classification_content_decay TEXT;
ALTER TABLE keywords ADD COLUMN classification_expertise_level TEXT;
ALTER TABLE keywords ADD COLUMN classification_buyer_behavior TEXT;
ALTER TABLE keywords ADD COLUMN classification_role_context TEXT;
ALTER TABLE keywords ADD COLUMN classification_journey_moment TEXT;
ALTER TABLE keywords ADD COLUMN classification_journey_direction TEXT;
ALTER TABLE keywords ADD COLUMN classification_query_specificity TEXT;
ALTER TABLE keywords ADD COLUMN classification_has_brand_mention INTEGER DEFAULT 0;
ALTER TABLE keywords ADD COLUMN classification_brand_mentioned TEXT;
-- Metadata
ALTER TABLE keywords ADD COLUMN classification_version INTEGER DEFAULT 2;
ALTER TABLE keywords ADD COLUMN classification_source TEXT; -- 'rules', 'vectorize', 'llm'
ALTER TABLE keywords ADD COLUMN classification_completed_at INTEGER;
ALTER TABLE keywords ADD COLUMN classification_needs_llm INTEGER DEFAULT 0;
Indexes
CREATE INDEX idx_keywords_journey_moment ON keywords(classification_journey_moment);
CREATE INDEX idx_keywords_expertise_level ON keywords(classification_expertise_level);
CREATE INDEX idx_keywords_buyer_behavior ON keywords(classification_buyer_behavior);
CREATE INDEX idx_keywords_query_specificity ON keywords(classification_query_specificity);
CREATE INDEX idx_keywords_has_brand_mention ON keywords(classification_has_brand_mention);
CREATE INDEX idx_keywords_needs_llm ON keywords(classification_needs_llm)
WHERE classification_needs_llm = 1;
Usage
Classify a Single Keyword
import { classifyKeyword } from "./lib/keyword-classifier.js";
const keywordData = {
keyword: "best project management software for small business",
keyword_info: {
search_volume: 2900,
cpc: 12.50,
competition: 0.85,
},
keyword_properties: {
keyword_difficulty: 65,
},
search_intent_info: {
main_intent: "commercial",
},
};
const result = await classifyKeyword(keywordData, {}, env);
console.log(result.classification.journey_moment);
// { value: "option_evaluating", confidence: 85, signals: ["pattern=option_evaluating"] }
console.log(result.classification.query_specificity);
// { value: "long_tail", confidence: 85, signals: ["tokens=7"] }
Batch Classification
import { classifyKeywordBatch, getClassificationStats } from "./lib/keyword-classifier.js";
const keywords = [
{ keyword: "how to use slack" },
{ keyword: "slack vs teams comparison" },
{ keyword: "buy slack enterprise" },
];
const results = await classifyKeywordBatch(keywords, {}, env);
const stats = getClassificationStats(results);
console.log(stats);
// {
// total: 3,
// successful: 3,
// by_journey_moment: { onboarding: 1, feature_comparing: 1, purchase_ready: 1 },
// average_confidence: 82
// }
Get Unclassified Keywords
import { getUnclassifiedKeywords, classifyKeyword, saveKeywordClassification } from "./lib/keyword-classifier.js";
// Get keywords needing classification
const keywords = await getUnclassifiedKeywords(env, 100);
for (const kw of keywords) {
const result = await classifyKeyword(kw, {}, env);
await saveKeywordClassification(kw.id, result.classification, env);
}
Classification Triggers
Keywords get classified automatically when:
- ranked_keywords endpoint - URLs from DataForSEO ranked keywords
- SERP tracking - New keywords discovered in SERP results
- Consumer queue - Processes
classification_completed_at IS NULL
Implicit vs Explicit Triggering
Implicit (recommended):
- Data sources just insert keywords normally
- Consumer watches for
classification_completed_at IS NULL - No need for sources to know about classification
Explicit:
- Call
classifyKeyword()directly after inserting - Use when you need immediate classification
Vectorize Index Setup
Create Index
wrangler vectorize create keyword-classifier \
--dimensions=768 \
--metric=cosine
Bind in wrangler.toml
[[vectorize]]
binding = "VECTORIZE_KEYWORDS"
index_name = "keyword-classifier"
Initialize with Seed Examples
import { initializeSeedExamples } from "./lib/keyword-classifier-vectorize.js";
const seedExamples = [
{
keyword: "how to start a blog",
classifications: {
journey_moment: "solution_curious",
expertise_level: "novice",
role_context: "hobbyist",
},
},
// ... more examples
];
await initializeSeedExamples(seedExamples, env);
Cost Analysis
| Stage | Cost per Query | Coverage |
|---|---|---|
| Rules | $0 | ~60% of dimensions |
| Vectorize | ~$0.00001 | ~25% of dimensions |
| Brand Lookup | $0 | 1 dimension |
| LLM | ~$0.0001 | ~15% (fallback) |
Typical cost per keyword: $0.00001 - $0.0001 depending on complexity.
At scale (100k keywords):
- Rules only: $0
- Rules + Vectorize: ~$1
- Full pipeline (10% LLM): ~$11
Frontend Filtering
Example API filtering by classification:
-- Get commercial keywords in decision phase
SELECT * FROM keywords
WHERE classification_journey_moment IN ('provider_selecting', 'purchase_ready', 'purchase_completing')
AND classification_intent_type = 'transactional'
AND classification_commercial_value_index > 100
ORDER BY classification_commercial_value_index DESC;
-- Get beginner informational content
SELECT * FROM keywords
WHERE classification_expertise_level = 'novice'
AND classification_intent_type = 'informational'
AND classification_content_decay IN ('timeless', 'slow_decay');
-- Get brand mentions for competitive analysis
SELECT * FROM keywords
WHERE classification_has_brand_mention = 1
AND classification_brand_mentioned != 'OurBrand';
Troubleshooting
Low Confidence Classifications
Check: Rules engine output
import { classifyKeywordWithRules } from "./lib/keyword-classifier-rules.js";
const result = classifyKeywordWithRules(keywordData);
console.log(result.dimensions_needing_llm);
Vectorize Not Finding Matches
Check: Index has seed examples
import { getIndexStats } from "./lib/keyword-classifier-vectorize.js";
const stats = await getIndexStats(env);
console.log(stats); // { dimensions: 768, count: 150 }
Fix: Initialize with more seed examples for better coverage.
Brand Detection Missing Known Brands
Check: Brand exists in brands table
SELECT * FROM brands WHERE LOWER(name) LIKE '%slack%';
Fix: Ensure brand is in brands table with correct name.
Files Reference
| File | Purpose |
|---|---|
src/lib/keyword-classifier.js | Main orchestrator (4-stage pipeline) |
src/lib/keyword-classifier-rules.js | Stage 1: Rules engine |
src/lib/keyword-classifier-vectorize.js | Stage 2: Vectorize similarity |
src/lib/keyword-classification-constants.js | Enums and pattern rules |
migrations/0093_keyword_classification_columns.sql | v1 schema |
migrations/0094_refine_keyword_dimensions.sql | v2 schema |