Skip to main content

Keyword Classification System

Multi-dimensional keyword intelligence using a 4-stage classification pipeline.


Overview

The keyword classification system provides:

  • Multi-Dimensional Classification - 20+ dimensions covering intent, journey, behavior, and more
  • 4-Stage Pipeline - Rules → Vectorize → Brand Lookup → LLM (progressive enhancement)
  • Self-Learning - High-confidence LLM results feed back into Vectorize
  • Implicit Triggers - Keywords with classification_completed_at IS NULL get processed automatically

Key Capabilities:

  • Journey moment tracking (12 granular stages vs 5 broad funnel stages)
  • Expertise level and buyer behavior profiling
  • Brand mention detection
  • Query specificity (head/torso/long-tail)
  • Content decay prediction
  • Commercial value indexing

Architecture

┌─────────────────────────────────────────────────────────────────┐
│ Keyword Classification │
├─────────────────────────────────────────────────────────────────┤
│ Classification Triggers: │
│ - New keywords from ranked_keywords endpoint │
│ - New keywords from SERP tracking │
│ - Keywords with classification_completed_at IS NULL │
│ │
│ 4-Stage Pipeline: │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Stage 1 │──▶│ Stage 2 │──▶│ Stage 3 │──▶│ Stage 4 │ │
│ │ Rules │ │ Vectorize│ │ Brand │ │ LLM │ │
│ │ (free) │ │ (cheap) │ │ (free) │ │(expensive│ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │ │ │
│ └──────────────┴──────────────┴──────────────┘ │
│ │ │
│ Final Classification │
│ │ │
│ ┌────────▼────────┐ │
│ │ Learning Loop │ │
│ │ (High confidence│ │
│ │ → Vectorize) │ │
│ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Classification Dimensions

Split from compound v1 dimensions for better filtering and analysis.

Journey Tracking

DimensionValuesDescription
journey_moment12 valuesSpecific point in buyer journey
journey_directionentering, progressing, stalled, regressing, exitingMovement through journey

Journey Moments:

  • Awareness: problem_unaware, problem_aware, solution_curious
  • Consideration: category_exploring, feature_comparing, option_evaluating
  • Decision: provider_selecting, purchase_ready, purchase_completing
  • Post-Purchase: onboarding, optimizing, advocating

Searcher Profile

DimensionValuesDescription
expertise_levelnovice, intermediate, advanced, expertKnowledge level
buyer_behaviorimpulsive, methodical, price_sensitive, quality_focused, brand_loyal, research_heavyPurchase approach
role_contextindividual_consumer, small_business_owner, enterprise_buyer, developer, marketer, student, professional, hobbyistWho is searching

Content & Demand

DimensionValuesDescription
demand_patternevergreen, seasonal_recurring, event_driven, trending_spike, decliningWhen demand occurs
content_decaytimeless, slow_decay, moderate_decay, fast_decay, real_timeHow quickly content goes stale
query_specificityhead, torso, long_tail, ultra_long_tailQuery length/specificity

Brand Detection

DimensionValuesDescription
has_brand_mentionbooleanDoes keyword contain a brand name
brand_mentionedtextThe actual brand name if detected

v1 Dimensions (Backwards Compatible)

These are still populated but the v2 dimensions provide more granular filtering.

DimensionDescription
funnel_stageawareness, consideration, decision, retention, advocacy
intent_typeinformational, navigational, commercial_investigation, transactional, local, support
personaresearcher_beginner, practitioner, buyer_decision_maker, etc.
keyword_patternquestion, comparison, superlative, problem, use_case, etc.
ranking_difficultyvery_easy, easy, medium, hard, very_hard
trend_velocityrising_fast, rising_slow, stable, declining, volatile
seasonalityevergreen, seasonal, event_driven
competitive_densitylow, medium, high, dominated
commercial_value_indexnumeric (volume × CPC)
sentiment_intentpositive, neutral, negative, urgent
content_format_intentlisticle, comparison_review, how_to_guide, video_preferred, etc.
serp_intent_signalsarray of SERP features present

Pipeline Stages

Stage 1: Rules Engine (Free)

Handles ~60% of classifications using regex patterns and DataForSEO signals.

Classifies:

  • keyword_pattern (regex-based)
  • ranking_difficulty (from KD score)
  • trend_velocity (from monthly_searches)
  • demand_pattern (from search trends)
  • query_specificity (from token count + volume)
  • intent_type (from DataForSEO + patterns)
  • sentiment_intent (from keyword text)
  • content_format_intent (from patterns + SERP)
  • competitive_density (from competition + CPC)

Files:

  • src/lib/keyword-classifier-rules.js
  • src/lib/keyword-classification-constants.js

Stage 2: Vectorize (Cheap, ~$0.00001/query)

Uses Cloudflare Vectorize for semantic similarity matching.

How it works:

  1. Convert keyword + metadata to feature text
  2. Embed using Workers AI (bge-base-en-v1.5)
  3. Find similar keywords in Vectorize index
  4. Vote on classification from top-k matches

Classifies:

  • expertise_level
  • buyer_behavior
  • role_context
  • journey_moment
  • topic_entity_type
  • use_case_type

Files:

  • src/lib/keyword-classifier-vectorize.js

Self-Learning: High-confidence LLM classifications (>80%) are added to the Vectorize index, making the system smarter over time.

Stage 3: Brand Lookup (Free)

Checks if keyword mentions a known brand from the brands table.

How it works:

  1. Tokenize keyword
  2. Try exact brand name matches
  3. Try prefix matches with LIKE
  4. Return brand_id and confidence

Classifies:

  • has_brand_mention
  • brand_mentioned

Stage 4: LLM (Expensive, ~$0.0001/query)

Workers AI fallback for semantic dimensions that rules/vectorize can't handle.

Classifies (when needed):

  • expertise_level
  • buyer_behavior
  • journey_moment
  • topic_entity_type
  • use_case_type

Files:

  • src/lib/keyword-classifier-llm.js (to be implemented)

Database Schema

Keywords Table Classification Columns

-- v1 dimensions (migration 0093)
ALTER TABLE keywords ADD COLUMN classification_funnel_stage TEXT;
ALTER TABLE keywords ADD COLUMN classification_funnel_stage_confidence REAL;
ALTER TABLE keywords ADD COLUMN classification_intent_type TEXT;
ALTER TABLE keywords ADD COLUMN classification_keyword_pattern TEXT;
ALTER TABLE keywords ADD COLUMN classification_ranking_difficulty TEXT;
ALTER TABLE keywords ADD COLUMN classification_trend_velocity TEXT;
ALTER TABLE keywords ADD COLUMN classification_seasonality TEXT;
ALTER TABLE keywords ADD COLUMN classification_commercial_value_index REAL;
ALTER TABLE keywords ADD COLUMN classification_competitive_density TEXT;
ALTER TABLE keywords ADD COLUMN classification_sentiment_intent TEXT;
ALTER TABLE keywords ADD COLUMN classification_content_format_intent TEXT;
ALTER TABLE keywords ADD COLUMN classification_serp_intent_signals TEXT; -- JSON array

-- v2 refined dimensions (migration 0094)
ALTER TABLE keywords ADD COLUMN classification_demand_pattern TEXT;
ALTER TABLE keywords ADD COLUMN classification_content_decay TEXT;
ALTER TABLE keywords ADD COLUMN classification_expertise_level TEXT;
ALTER TABLE keywords ADD COLUMN classification_buyer_behavior TEXT;
ALTER TABLE keywords ADD COLUMN classification_role_context TEXT;
ALTER TABLE keywords ADD COLUMN classification_journey_moment TEXT;
ALTER TABLE keywords ADD COLUMN classification_journey_direction TEXT;
ALTER TABLE keywords ADD COLUMN classification_query_specificity TEXT;
ALTER TABLE keywords ADD COLUMN classification_has_brand_mention INTEGER DEFAULT 0;
ALTER TABLE keywords ADD COLUMN classification_brand_mentioned TEXT;

-- Metadata
ALTER TABLE keywords ADD COLUMN classification_version INTEGER DEFAULT 2;
ALTER TABLE keywords ADD COLUMN classification_source TEXT; -- 'rules', 'vectorize', 'llm'
ALTER TABLE keywords ADD COLUMN classification_completed_at INTEGER;
ALTER TABLE keywords ADD COLUMN classification_needs_llm INTEGER DEFAULT 0;

Indexes

CREATE INDEX idx_keywords_journey_moment ON keywords(classification_journey_moment);
CREATE INDEX idx_keywords_expertise_level ON keywords(classification_expertise_level);
CREATE INDEX idx_keywords_buyer_behavior ON keywords(classification_buyer_behavior);
CREATE INDEX idx_keywords_query_specificity ON keywords(classification_query_specificity);
CREATE INDEX idx_keywords_has_brand_mention ON keywords(classification_has_brand_mention);
CREATE INDEX idx_keywords_needs_llm ON keywords(classification_needs_llm)
WHERE classification_needs_llm = 1;

Usage

Classify a Single Keyword

import { classifyKeyword } from "./lib/keyword-classifier.js";

const keywordData = {
keyword: "best project management software for small business",
keyword_info: {
search_volume: 2900,
cpc: 12.50,
competition: 0.85,
},
keyword_properties: {
keyword_difficulty: 65,
},
search_intent_info: {
main_intent: "commercial",
},
};

const result = await classifyKeyword(keywordData, {}, env);

console.log(result.classification.journey_moment);
// { value: "option_evaluating", confidence: 85, signals: ["pattern=option_evaluating"] }

console.log(result.classification.query_specificity);
// { value: "long_tail", confidence: 85, signals: ["tokens=7"] }

Batch Classification

import { classifyKeywordBatch, getClassificationStats } from "./lib/keyword-classifier.js";

const keywords = [
{ keyword: "how to use slack" },
{ keyword: "slack vs teams comparison" },
{ keyword: "buy slack enterprise" },
];

const results = await classifyKeywordBatch(keywords, {}, env);
const stats = getClassificationStats(results);

console.log(stats);
// {
// total: 3,
// successful: 3,
// by_journey_moment: { onboarding: 1, feature_comparing: 1, purchase_ready: 1 },
// average_confidence: 82
// }

Get Unclassified Keywords

import { getUnclassifiedKeywords, classifyKeyword, saveKeywordClassification } from "./lib/keyword-classifier.js";

// Get keywords needing classification
const keywords = await getUnclassifiedKeywords(env, 100);

for (const kw of keywords) {
const result = await classifyKeyword(kw, {}, env);
await saveKeywordClassification(kw.id, result.classification, env);
}

Classification Triggers

Keywords get classified automatically when:

  1. ranked_keywords endpoint - URLs from DataForSEO ranked keywords
  2. SERP tracking - New keywords discovered in SERP results
  3. Consumer queue - Processes classification_completed_at IS NULL

Implicit vs Explicit Triggering

Implicit (recommended):

  • Data sources just insert keywords normally
  • Consumer watches for classification_completed_at IS NULL
  • No need for sources to know about classification

Explicit:

  • Call classifyKeyword() directly after inserting
  • Use when you need immediate classification

Vectorize Index Setup

Create Index

wrangler vectorize create keyword-classifier \
--dimensions=768 \
--metric=cosine

Bind in wrangler.toml

[[vectorize]]
binding = "VECTORIZE_KEYWORDS"
index_name = "keyword-classifier"

Initialize with Seed Examples

import { initializeSeedExamples } from "./lib/keyword-classifier-vectorize.js";

const seedExamples = [
{
keyword: "how to start a blog",
classifications: {
journey_moment: "solution_curious",
expertise_level: "novice",
role_context: "hobbyist",
},
},
// ... more examples
];

await initializeSeedExamples(seedExamples, env);

Cost Analysis

StageCost per QueryCoverage
Rules$0~60% of dimensions
Vectorize~$0.00001~25% of dimensions
Brand Lookup$01 dimension
LLM~$0.0001~15% (fallback)

Typical cost per keyword: $0.00001 - $0.0001 depending on complexity.

At scale (100k keywords):

  • Rules only: $0
  • Rules + Vectorize: ~$1
  • Full pipeline (10% LLM): ~$11

Frontend Filtering

Example API filtering by classification:

-- Get commercial keywords in decision phase
SELECT * FROM keywords
WHERE classification_journey_moment IN ('provider_selecting', 'purchase_ready', 'purchase_completing')
AND classification_intent_type = 'transactional'
AND classification_commercial_value_index > 100
ORDER BY classification_commercial_value_index DESC;

-- Get beginner informational content
SELECT * FROM keywords
WHERE classification_expertise_level = 'novice'
AND classification_intent_type = 'informational'
AND classification_content_decay IN ('timeless', 'slow_decay');

-- Get brand mentions for competitive analysis
SELECT * FROM keywords
WHERE classification_has_brand_mention = 1
AND classification_brand_mentioned != 'OurBrand';

Troubleshooting

Low Confidence Classifications

Check: Rules engine output

import { classifyKeywordWithRules } from "./lib/keyword-classifier-rules.js";
const result = classifyKeywordWithRules(keywordData);
console.log(result.dimensions_needing_llm);

Vectorize Not Finding Matches

Check: Index has seed examples

import { getIndexStats } from "./lib/keyword-classifier-vectorize.js";
const stats = await getIndexStats(env);
console.log(stats); // { dimensions: 768, count: 150 }

Fix: Initialize with more seed examples for better coverage.

Brand Detection Missing Known Brands

Check: Brand exists in brands table

SELECT * FROM brands WHERE LOWER(name) LIKE '%slack%';

Fix: Ensure brand is in brands table with correct name.


Files Reference

FilePurpose
src/lib/keyword-classifier.jsMain orchestrator (4-stage pipeline)
src/lib/keyword-classifier-rules.jsStage 1: Rules engine
src/lib/keyword-classifier-vectorize.jsStage 2: Vectorize similarity
src/lib/keyword-classification-constants.jsEnums and pattern rules
migrations/0093_keyword_classification_columns.sqlv1 schema
migrations/0094_refine_keyword_dimensions.sqlv2 schema