Skip to main content

Classification Pipeline Master Plan

Executive Summary

This document tracks the implementation of critical improvements to our three classification pipelines (Domain, URL, Keyword).

STATUS: Sprints 1-4 COMPLETE (December 2024)


Implementation Status

SprintPhaseStatusNotes
Sprint 1Remove Bubble-Up, Domain-FirstCOMPLETERemoved maybeUpdateDomainClassification(), added domain-first enforcement
Sprint 2Learning ImprovementsCOMPLETEAll pipelines now consistently feed Vectorize
Sprint 3Negative LearningCOMPLETEclassification_corrections table, correction_patterns, API endpoints
Sprint 4Admin ConsoleCOMPLETEEnterprise-grade UI with DataTable, KPIs, charts
Sprint 5Polish & MonitorCOMPLETEDocumentation, mermaid diagrams updated

Completed Work

Sprint 1: Clean Foundation (COMPLETE)

1.1 Removed Domain Bubble-Up

  • Removed maybeUpdateDomainClassification() from backlink-classify-consumer.js
  • Removed all calls at lines 624 and 936
  • Removed unused DOMAIN_AGGREGATION_* constants

1.2 Enforce Domain Classification Before URL

  • Added domain classification check in ensureUrl() in url-management.js
  • URLs cannot be classified without their domain being classified first
  • If domain not classified, URL classification waits or triggers domain classification

Sprint 2: Learning Improvements (COMPLETE)

2.1 Domain Pipeline Learning

  • Moved learning to Stage 6 for ALL classification sources (not just LLM)
  • Imports centralized from classification-config.js:
    • DOMAIN_LEARNING_MIN_CONFIDENCE (80%)
    • shouldTriggerLearning(confidence, 'domain')
  • Rules engine matches at 95%+ now feed Vectorize
  • LLM results at 80%+ feed Vectorize

2.2 URL Pipeline Learning

  • Standardized learning thresholds via classification-config.js
  • Uses LEARNING_MIN_CONFIDENCE (65%) consistently
  • Added shouldTriggerLearning() checks in learnFromClassification()

2.3 Keyword Pipeline

  • Already well-designed, learning at 70% threshold
  • No changes needed, serves as model for other pipelines

Sprint 3: Negative Learning (COMPLETE)

3.1 Database Schema

Created migrations/0119_classification_corrections.sql:

-- Track classification corrections/feedback
CREATE TABLE classification_corrections (
id INTEGER PRIMARY KEY AUTOINCREMENT,
entity_type TEXT NOT NULL, -- 'domain', 'url', 'keyword'
entity_id INTEGER NOT NULL,
original_dimension TEXT NOT NULL,
original_value TEXT,
corrected_value TEXT NOT NULL,
confidence_before INTEGER,
notes TEXT,
created_by TEXT DEFAULT 'system',
created_at INTEGER NOT NULL DEFAULT (unixepoch()),
processed_at INTEGER,
UNIQUE(entity_type, entity_id, original_dimension, corrected_value)
);

-- Track correction patterns for generating rules
CREATE TABLE correction_patterns (
id INTEGER PRIMARY KEY AUTOINCREMENT,
entity_type TEXT NOT NULL,
dimension TEXT NOT NULL,
from_value TEXT,
to_value TEXT NOT NULL,
count INTEGER DEFAULT 1,
suggested_rule TEXT,
created_at INTEGER NOT NULL DEFAULT (unixepoch()),
last_seen_at INTEGER NOT NULL DEFAULT (unixepoch()),
UNIQUE(entity_type, dimension, from_value, to_value)
);

3.2 Corrections Module

Created src/lib/classification-corrections.js:

  • recordCorrection() - Save correction and update entity
  • getPendingCorrections() - Get unprocessed corrections
  • getSuggestedRules() - Analyze patterns for auto-rules
  • getCorrectionHistory() - View correction history
  • getCorrectionStats() - Dashboard statistics
  • processPendingCorrections() - Batch process for Vectorize feedback

3.3 Admin Endpoints

Added to src/endpoints/admin-classifier.js:

  • POST /api/admin/classifier/corrections - Submit correction
  • GET /api/admin/classifier/corrections/stats - Dashboard stats
  • GET /api/admin/classifier/corrections/history/:type/:id - Entity history
  • GET /api/admin/classifier/corrections/patterns - Suggested rules
  • POST /api/admin/classifier/corrections/learn - Process pending

Sprint 4: Admin Console (COMPLETE)

4.1 Enterprise Design System

Complete rewrite of console/css/style.css:

  • CSS custom properties for theming
  • Dark theme with professional color palette
  • KPI cards with trends and colors
  • Progress bars and badges
  • Toast notifications
  • Responsive grid layouts

4.2 Reusable Components

Created console/js/components.js:

  • DataTable class with:
    • Sorting (click headers)
    • Pagination (configurable page sizes)
    • Search (debounced)
    • Custom cell renderers
    • Loading/empty states
  • Toast notification system
  • Format helpers (number, percent, currency, date, relative)
  • confidenceBadge() - Color-coded confidence display
  • classificationBadge() - Classification value badges

4.3 Page Upgrades

Main Dashboard (console/index.html):

  • New sidebar with logo and nav sections
  • System KPIs with trend indicators
  • Entity cards with progress bars
  • Health table with status badges
  • Distribution charts

URLs Page (console/pages/urls.html):

  • Tabbed interface (Overview, Distributions, Recent, Needs Review)
  • DataTable with full sorting/pagination
  • Source filter dropdown
  • Confidence distribution charts

Domains Page (console/pages/domains.html):

  • Tabbed interface
  • Confidence by dimension table
  • Domain type distribution charts
  • Classify unclassified button

Keywords Page (console/pages/keywords.html):

  • 11 dimension charts
  • Brand and location stats
  • Journey/Intent confidence KPIs
  • Random sample loader

Costs Page (console/pages/costs.html):

  • Budget tracking with alerts
  • Daily cost line chart
  • Service breakdown pie charts
  • Cost per request efficiency
  • Export CSV button

Queues Page (console/pages/queues.html):

  • Queue cards grid with status badges
  • Processing rate chart
  • Failed messages table
  • Retry all / Retry individual buttons
  • Real-time status indicators

Corrections Page (console/pages/corrections.html):

  • NEW page for negative learning
  • Correction stats KPIs
  • Pattern analysis
  • Add correction form
  • Correction history table

Architecture Diagrams

Classification Pipeline Flow

Domain Classification Pipeline Detail

Admin Console Architecture


Success Metrics

MetricTargetCurrent
Domain Classification Coverage100% before URLsCOMPLETE
Learning Rate (Vectorize)> 60% of high-confCOMPLETE
Correction SystemImplementedCOMPLETE
Admin ConsoleEnterprise-gradeCOMPLETE
Review Queue< 100 itemsMONITORING

Files Modified/Created

Phase 1 (Remove Bubble-Up)

  • src/queue/backlink-classify-consumer.js - Removed bubble-up function
  • src/lib/url-management.js - Domain-first enforcement

Phase 2 (Learning)

  • src/lib/domain-classifier.js - Learning for all stages
  • src/lib/url-classifier.js - Standardized learning thresholds
  • src/lib/classification-config.js - Centralized thresholds

Phase 3 (Negative Learning)

  • migrations/0119_classification_corrections.sql - New tables
  • src/lib/classification-corrections.js - NEW module
  • src/endpoints/admin-classifier.js - Correction endpoints

Phase 4 (Console)

  • console/css/style.css - Enterprise design system
  • console/js/components.js - DataTable, Toast, Format
  • console/js/api.js - Correction API methods
  • console/index.html - Complete redesign
  • console/pages/urls.html - Enterprise upgrade
  • console/pages/domains.html - Enterprise upgrade
  • console/pages/keywords.html - Enterprise upgrade
  • console/pages/costs.html - Enterprise upgrade
  • console/pages/queues.html - Enterprise upgrade
  • console/pages/corrections.html - NEW page

Future Considerations

Index Versioning (Deferred)

  • Weekly snapshots of Vectorize to R2
  • Version metadata for rollback
  • Not urgent, can implement if quality issues arise

Cross-Pipeline Learning (Deferred)

  • Keyword brand mentions → domain verification
  • URL classification → domain reinforcement
  • Always via LLM with human oversight
  • Queue for review, never auto-update

LLM Model Selection (Deferred)

  • Different models for different confidence levels
  • Claude for low-confidence review
  • Llama 3.3 70B for standard classification

Maintenance Notes

Adding New Classification Dimensions

  1. Update pipeline stage in appropriate classifier
  2. Add rules in classifier-rules-engine.js
  3. Update Vectorize metadata schema
  4. Add column to D1 table
  5. Update admin console charts/tables

Monitoring Classification Quality

  1. Check admin dashboard KPIs daily
  2. Review low-confidence queue weekly
  3. Analyze correction patterns monthly
  4. Update rules engine based on patterns

Deploying Changes

  1. Run migrations: wrangler d1 execute RANKFABRIC_DB --file=migrations/XXXX.sql
  2. Deploy worker: wrangler deploy
  3. Deploy console: wrangler pages deploy console/
  4. Verify via admin console