Skip to main content

Classification Pipeline Master Plan

Executive Summary

This document tracks the implementation of critical improvements to our three classification pipelines (Domain, URL, Keyword).

STATUS: Sprints 1-4 COMPLETE (December 2024)


Implementation Status

SprintPhaseStatusNotes
Sprint 1Remove Bubble-Up, Domain-FirstCOMPLETERemoved maybeUpdateDomainClassification(), added domain-first enforcement
Sprint 2Learning ImprovementsCOMPLETEAll pipelines now consistently feed Vectorize
Sprint 3Negative LearningCOMPLETEclassification_corrections table, correction_patterns, API endpoints
Sprint 4Admin ConsoleCOMPLETEEnterprise-grade UI with DataTable, KPIs, charts
Sprint 5Polish & MonitorCOMPLETEDocumentation, mermaid diagrams updated

Completed Work

Sprint 1: Clean Foundation (COMPLETE)

1.1 Removed Domain Bubble-Up

  • Removed maybeUpdateDomainClassification() from backlink-classify-consumer.js
  • Removed all calls at lines 624 and 936
  • Removed unused DOMAIN_AGGREGATION_* constants

1.2 Enforce Domain Classification Before URL

  • Added domain classification check in ensureUrl() in url-management.js
  • URLs cannot be classified without their domain being classified first
  • If domain not classified, URL classification waits or triggers domain classification

Sprint 2: Learning Improvements (COMPLETE)

2.1 Domain Pipeline Learning

  • Moved learning to Stage 6 for ALL classification sources (not just LLM)
  • Imports centralized from classification-config.js:
    • DOMAIN_LEARNING_MIN_CONFIDENCE (80%)
    • shouldTriggerLearning(confidence, 'domain')
  • Rules engine matches at 95%+ now feed Vectorize
  • LLM results at 80%+ feed Vectorize

2.2 URL Pipeline Learning

  • Standardized learning thresholds via classification-config.js
  • Uses LEARNING_MIN_CONFIDENCE (65%) consistently
  • Added shouldTriggerLearning() checks in learnFromClassification()

2.3 Keyword Pipeline

  • Already well-designed, learning at 70% threshold
  • No changes needed, serves as model for other pipelines

Sprint 3: Negative Learning (COMPLETE)

3.1 Database Schema

Created migrations/0119_classification_corrections.sql:

-- Track classification corrections/feedback
CREATE TABLE classification_corrections (
id INTEGER PRIMARY KEY AUTOINCREMENT,
entity_type TEXT NOT NULL, -- 'domain', 'url', 'keyword'
entity_id INTEGER NOT NULL,
original_dimension TEXT NOT NULL,
original_value TEXT,
corrected_value TEXT NOT NULL,
confidence_before INTEGER,
notes TEXT,
created_by TEXT DEFAULT 'system',
created_at INTEGER NOT NULL DEFAULT (unixepoch()),
processed_at INTEGER,
UNIQUE(entity_type, entity_id, original_dimension, corrected_value)
);

-- Track correction patterns for generating rules
CREATE TABLE correction_patterns (
id INTEGER PRIMARY KEY AUTOINCREMENT,
entity_type TEXT NOT NULL,
dimension TEXT NOT NULL,
from_value TEXT,
to_value TEXT NOT NULL,
count INTEGER DEFAULT 1,
suggested_rule TEXT,
created_at INTEGER NOT NULL DEFAULT (unixepoch()),
last_seen_at INTEGER NOT NULL DEFAULT (unixepoch()),
UNIQUE(entity_type, dimension, from_value, to_value)
);

3.2 Corrections Module

Created src/lib/classification-corrections.js:

  • recordCorrection() - Save correction and update entity
  • getPendingCorrections() - Get unprocessed corrections
  • getSuggestedRules() - Analyze patterns for auto-rules
  • getCorrectionHistory() - View correction history
  • getCorrectionStats() - Dashboard statistics
  • processPendingCorrections() - Batch process for Vectorize feedback

3.3 Admin Endpoints

Added to src/endpoints/admin-classifier.js:

  • POST /api/admin/classifier/corrections - Submit correction
  • GET /api/admin/classifier/corrections/stats - Dashboard stats
  • GET /api/admin/classifier/corrections/history/:type/:id - Entity history
  • GET /api/admin/classifier/corrections/patterns - Suggested rules
  • POST /api/admin/classifier/corrections/learn - Process pending

Sprint 4: Admin Console (COMPLETE)

4.1 Enterprise Design System

Complete rewrite of console/css/style.css:

  • CSS custom properties for theming
  • Dark theme with professional color palette
  • KPI cards with trends and colors
  • Progress bars and badges
  • Toast notifications
  • Responsive grid layouts

4.2 Reusable Components

Created console/js/components.js:

  • DataTable class with:
    • Sorting (click headers)
    • Pagination (configurable page sizes)
    • Search (debounced)
    • Custom cell renderers
    • Loading/empty states
  • Toast notification system
  • Format helpers (number, percent, currency, date, relative)
  • confidenceBadge() - Color-coded confidence display
  • classificationBadge() - Classification value badges

4.3 Page Upgrades

Main Dashboard (console/index.html):

  • New sidebar with logo and nav sections
  • System KPIs with trend indicators
  • Entity cards with progress bars
  • Health table with status badges
  • Distribution charts

URLs Page (console/pages/urls.html):

  • Tabbed interface (Overview, Distributions, Recent, Needs Review)
  • DataTable with full sorting/pagination
  • Source filter dropdown
  • Confidence distribution charts

Domains Page (console/pages/domains.html):

  • Tabbed interface
  • Confidence by dimension table
  • Domain type distribution charts
  • Classify unclassified button

Keywords Page (console/pages/keywords.html):

  • 11 dimension charts
  • Brand and location stats
  • Journey/Intent confidence KPIs
  • Random sample loader

Costs Page (console/pages/costs.html):

  • Budget tracking with alerts
  • Daily cost line chart
  • Service breakdown pie charts
  • Cost per request efficiency
  • Export CSV button

Queues Page (console/pages/queues.html):

  • Queue cards grid with status badges
  • Processing rate chart
  • Failed messages table
  • Retry all / Retry individual buttons
  • Real-time status indicators

Corrections Page (console/pages/corrections.html):

  • NEW page for negative learning
  • Correction stats KPIs
  • Pattern analysis
  • Add correction form
  • Correction history table

Architecture Diagrams

Classification Pipeline Flow

flowchart TB
subgraph Entry["ENTRY POINTS"]
API["API Request<br/>/keywords, /urls, /backlinks"]
CRON["Cron Jobs<br/>Daily subscriptions"]
WEBHOOK["Webhooks<br/>DataForSEO callbacks"]
end

subgraph Gates["GATEKEEPERS"]
ED["ensureDomain()"]
EU["ensureUrl()"]
end

subgraph Queues["CLASSIFICATION QUEUES"]
DQ[["DOMAIN_CLASSIFY_QUEUE"]]
UQ[["URL_CLASSIFY_QUEUE"]]
KQ[["KEYWORD_CLASSIFY_QUEUE"]]
LQ[["LLM_VERIFY_QUEUE"]]
end

subgraph Pipelines["CLASSIFICATION PIPELINES"]
DP["Domain Pipeline<br/>7 stages"]
UP["URL Pipeline<br/>6 stages"]
KP["Keyword Pipeline<br/>5 stages"]
end

subgraph Learning["SELF-LEARNING"]
VD["Vectorize<br/>domain-classifier<br/>>= 80%"]
VU["Vectorize<br/>backlink-classifier<br/>>= 65%"]
VK["Vectorize<br/>keyword-classifier<br/>>= 70%"]
end

subgraph Storage["D1 STORAGE"]
DOMAINS[("domains")]
URLS[("urls")]
KEYWORDS[("keywords")]
CORRECTIONS[("classification_corrections")]
end

subgraph Admin["ADMIN CONSOLE"]
DASH["Dashboard<br/>KPIs & Charts"]
REVIEW["Review Queue<br/>Low confidence"]
CORR["Corrections<br/>Negative learning"]
end

API --> ED
CRON --> ED
WEBHOOK --> EU

ED -->|"not classified"| DQ
EU -->|"domain first"| ED
EU -->|"then classify"| UQ

DQ --> DP
UQ --> UP
KQ --> KP

DP -->|">= 80%"| VD
UP -->|">= 65%"| VU
KP -->|">= 70%"| VK

DP --> DOMAINS
UP --> URLS
KP --> KEYWORDS

UP -->|"< 50%"| LQ
LQ --> UP

DOMAINS --> DASH
URLS --> DASH
KEYWORDS --> DASH

DASH --> REVIEW
REVIEW --> CORR
CORR --> CORRECTIONS
CORRECTIONS -->|"patterns"| VD
CORRECTIONS -->|"patterns"| VU
CORRECTIONS -->|"patterns"| VK

style DP fill:#e8f5e9
style UP fill:#e8f5e9
style KP fill:#e8f5e9
style VD fill:#e0f7fa
style VU fill:#e0f7fa
style VK fill:#e0f7fa
style CORRECTIONS fill:#fff3e0

Domain Classification Pipeline Detail

flowchart LR
subgraph FREE["FREE STAGES"]
S0["0. Cache<br/>D1 lookup"]
S1["1. Rules<br/>7,100+ domains"]
S15["1.5 Google Ads<br/>Category hints"]
S2["2. Vectorize<br/>Semantic similarity"]
S3["3. Low-Noise<br/>HEAD + 8KB GET"]
end

subgraph PAID["PAID STAGES"]
S4["4. Instant Pages<br/>$0.000125"]
S45["4.5 Patterns<br/>Domain regex"]
S5["5. LLM<br/>~$0.0001"]
end

S6["6. Store & Learn<br/>Save + Vectorize"]

S0 -->|"MISS"| S1
S1 -->|"< 70%"| S15
S15 -->|"< 70%"| S2
S2 -->|"< 70%"| S3
S3 -->|"< 70%"| S4
S4 -->|"< 70%"| S45
S45 -->|"< 70%"| S5
S5 --> S6

S0 -->|">= 60%"| DONE["Done"]
S1 -->|">= 80%"| S6
S2 -->|">= 80%"| S6
S3 -->|">= 70%"| S6
S4 -->|">= 70%"| S6
S45 -->|">= 70%"| S6

style S0 fill:#c8e6c9
style S1 fill:#c8e6c9
style S15 fill:#c8e6c9
style S2 fill:#c8e6c9
style S3 fill:#c8e6c9
style S4 fill:#fff3e0
style S45 fill:#c8e6c9
style S5 fill:#ffcdd2
style S6 fill:#e0f7fa

Admin Console Architecture

flowchart TB
subgraph Console["ADMIN CONSOLE (Cloudflare Pages)"]
INDEX["index.html<br/>Dashboard"]
DOMAINS["domains.html<br/>Domain classification"]
URLS["urls.html<br/>URL classification"]
KEYWORDS["keywords.html<br/>Keyword classification"]
COSTS["costs.html<br/>Cost tracking"]
QUEUES["queues.html<br/>Queue status"]
CORR["corrections.html<br/>Negative learning"]
end

subgraph Components["REUSABLE COMPONENTS"]
DT["DataTable<br/>Sort/Page/Search"]
TOAST["Toast<br/>Notifications"]
FMT["Format<br/>Number/Date/Currency"]
BADGE["Badges<br/>Confidence/Classification"]
end

subgraph API["WORKER API"]
STATS["/api/admin/classifier/*"]
QUEUE["/api/admin/queues/*"]
COST["/api/admin/costs/*"]
CORR_API["/api/admin/classifier/corrections/*"]
end

subgraph Storage["D1 DATABASE"]
D[("domains")]
U[("urls")]
K[("keywords")]
CC[("classification_corrections")]
CP[("correction_patterns")]
end

INDEX --> DT
DOMAINS --> DT
URLS --> DT
KEYWORDS --> DT
QUEUES --> DT
CORR --> DT

INDEX --> STATS
DOMAINS --> STATS
URLS --> STATS
KEYWORDS --> STATS
COSTS --> COST
QUEUES --> QUEUE
CORR --> CORR_API

STATS --> D
STATS --> U
STATS --> K
CORR_API --> CC
CORR_API --> CP

style INDEX fill:#e3f2fd
style DOMAINS fill:#e8f5e9
style URLS fill:#fff3e0
style KEYWORDS fill:#f3e5f5
style COSTS fill:#fce4ec
style QUEUES fill:#e0f7fa
style CORR fill:#fff8e1

Success Metrics

MetricTargetCurrent
Domain Classification Coverage100% before URLsCOMPLETE
Learning Rate (Vectorize)> 60% of high-confCOMPLETE
Correction SystemImplementedCOMPLETE
Admin ConsoleEnterprise-gradeCOMPLETE
Review Queue< 100 itemsMONITORING

Files Modified/Created

Phase 1 (Remove Bubble-Up)

  • src/queue/backlink-classify-consumer.js - Removed bubble-up function
  • src/lib/url-management.js - Domain-first enforcement

Phase 2 (Learning)

  • src/lib/domain-classifier.js - Learning for all stages
  • src/lib/url-classifier.js - Standardized learning thresholds
  • src/lib/classification-config.js - Centralized thresholds

Phase 3 (Negative Learning)

  • migrations/0119_classification_corrections.sql - New tables
  • src/lib/classification-corrections.js - NEW module
  • src/endpoints/admin-classifier.js - Correction endpoints

Phase 4 (Console)

  • console/css/style.css - Enterprise design system
  • console/js/components.js - DataTable, Toast, Format
  • console/js/api.js - Correction API methods
  • console/index.html - Complete redesign
  • console/pages/urls.html - Enterprise upgrade
  • console/pages/domains.html - Enterprise upgrade
  • console/pages/keywords.html - Enterprise upgrade
  • console/pages/costs.html - Enterprise upgrade
  • console/pages/queues.html - Enterprise upgrade
  • console/pages/corrections.html - NEW page

Future Considerations

Index Versioning (Deferred)

  • Weekly snapshots of Vectorize to R2
  • Version metadata for rollback
  • Not urgent, can implement if quality issues arise

Cross-Pipeline Learning (Deferred)

  • Keyword brand mentions → domain verification
  • URL classification → domain reinforcement
  • Always via LLM with human oversight
  • Queue for review, never auto-update

LLM Model Selection (Deferred)

  • Different models for different confidence levels
  • Claude for low-confidence review
  • Llama 3.3 70B for standard classification

Maintenance Notes

Adding New Classification Dimensions

  1. Update pipeline stage in appropriate classifier
  2. Add rules in classifier-rules-engine.js
  3. Update Vectorize metadata schema
  4. Add column to D1 table
  5. Update admin console charts/tables

Monitoring Classification Quality

  1. Check admin dashboard KPIs daily
  2. Review low-confidence queue weekly
  3. Analyze correction patterns monthly
  4. Update rules engine based on patterns

Deploying Changes

  1. Run migrations: wrangler d1 execute RANKFABRIC_DB --file=migrations/XXXX.sql
  2. Deploy worker: wrangler deploy
  3. Deploy console: wrangler pages deploy console/
  4. Verify via admin console