Classification Pipeline Master Plan

Executive Summary

This document tracks the implementation of critical improvements to our three classification pipelines (Domain, URL, Keyword).

STATUS: Sprints 1-4 COMPLETE (December 2024)

Implementation Status

Sprint	Phase	Status	Notes
Sprint 1	Remove Bubble-Up, Domain-First	COMPLETE	Removed maybeUpdateDomainClassification(), added domain-first enforcement
Sprint 2	Learning Improvements	COMPLETE	All pipelines now consistently feed Vectorize
Sprint 3	Negative Learning	COMPLETE	classification_corrections table, correction_patterns, API endpoints
Sprint 4	Admin Console	COMPLETE	Enterprise-grade UI with DataTable, KPIs, charts
Sprint 5	Polish & Monitor	COMPLETE	Documentation, mermaid diagrams updated

Completed Work

Sprint 1: Clean Foundation (COMPLETE)

1.1 Removed Domain Bubble-Up

Removed maybeUpdateDomainClassification() from backlink-classify-consumer.js
Removed all calls at lines 624 and 936
Removed unused DOMAIN_AGGREGATION_* constants

1.2 Enforce Domain Classification Before URL

Added domain classification check in ensureUrl() in url-management.js
URLs cannot be classified without their domain being classified first
If domain not classified, URL classification waits or triggers domain classification

Sprint 2: Learning Improvements (COMPLETE)

2.1 Domain Pipeline Learning

Moved learning to Stage 6 for ALL classification sources (not just LLM)
Imports centralized from classification-config.js:
- DOMAIN_LEARNING_MIN_CONFIDENCE (80%)
- shouldTriggerLearning(confidence, 'domain')
Rules engine matches at 95%+ now feed Vectorize
LLM results at 80%+ feed Vectorize

2.2 URL Pipeline Learning

Standardized learning thresholds via classification-config.js
Uses LEARNING_MIN_CONFIDENCE (65%) consistently
Added shouldTriggerLearning() checks in learnFromClassification()

2.3 Keyword Pipeline

Already well-designed, learning at 70% threshold
No changes needed, serves as model for other pipelines

Sprint 3: Negative Learning (COMPLETE)

3.1 Database Schema

Created migrations/0119_classification_corrections.sql:

-- Track classification corrections/feedback
CREATE TABLE classification_corrections (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  entity_type TEXT NOT NULL,        -- 'domain', 'url', 'keyword'
  entity_id INTEGER NOT NULL,
  original_dimension TEXT NOT NULL,
  original_value TEXT,
  corrected_value TEXT NOT NULL,
  confidence_before INTEGER,
  notes TEXT,
  created_by TEXT DEFAULT 'system',
  created_at INTEGER NOT NULL DEFAULT (unixepoch()),
  processed_at INTEGER,
  UNIQUE(entity_type, entity_id, original_dimension, corrected_value)
);

-- Track correction patterns for generating rules
CREATE TABLE correction_patterns (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  entity_type TEXT NOT NULL,
  dimension TEXT NOT NULL,
  from_value TEXT,
  to_value TEXT NOT NULL,
  count INTEGER DEFAULT 1,
  suggested_rule TEXT,
  created_at INTEGER NOT NULL DEFAULT (unixepoch()),
  last_seen_at INTEGER NOT NULL DEFAULT (unixepoch()),
  UNIQUE(entity_type, dimension, from_value, to_value)
);

3.2 Corrections Module

Created src/lib/classification-corrections.js:

recordCorrection() - Save correction and update entity
getPendingCorrections() - Get unprocessed corrections
getSuggestedRules() - Analyze patterns for auto-rules
getCorrectionHistory() - View correction history
getCorrectionStats() - Dashboard statistics
processPendingCorrections() - Batch process for Vectorize feedback

3.3 Admin Endpoints

Added to src/endpoints/admin-classifier.js:

POST /api/admin/classifier/corrections - Submit correction
GET /api/admin/classifier/corrections/stats - Dashboard stats
GET /api/admin/classifier/corrections/history/:type/:id - Entity history
GET /api/admin/classifier/corrections/patterns - Suggested rules
POST /api/admin/classifier/corrections/learn - Process pending

Sprint 4: Admin Console (COMPLETE)

4.1 Enterprise Design System

Complete rewrite of console/css/style.css:

CSS custom properties for theming
Dark theme with professional color palette
KPI cards with trends and colors
Progress bars and badges
Toast notifications
Responsive grid layouts

4.2 Reusable Components

Created console/js/components.js:

DataTable class with:
- Sorting (click headers)
- Pagination (configurable page sizes)
- Search (debounced)
- Custom cell renderers
- Loading/empty states
Toast notification system
Format helpers (number, percent, currency, date, relative)
confidenceBadge() - Color-coded confidence display
classificationBadge() - Classification value badges

4.3 Page Upgrades

Main Dashboard (console/index.html):

New sidebar with logo and nav sections
System KPIs with trend indicators
Entity cards with progress bars
Health table with status badges
Distribution charts

URLs Page (console/pages/urls.html):

Tabbed interface (Overview, Distributions, Recent, Needs Review)
DataTable with full sorting/pagination
Source filter dropdown
Confidence distribution charts

Domains Page (console/pages/domains.html):

Tabbed interface
Confidence by dimension table
Domain type distribution charts
Classify unclassified button

Keywords Page (console/pages/keywords.html):

11 dimension charts
Brand and location stats
Journey/Intent confidence KPIs
Random sample loader

Costs Page (console/pages/costs.html):

Budget tracking with alerts
Daily cost line chart
Service breakdown pie charts
Cost per request efficiency
Export CSV button

Queues Page (console/pages/queues.html):

Queue cards grid with status badges
Processing rate chart
Failed messages table
Retry all / Retry individual buttons
Real-time status indicators

Corrections Page (console/pages/corrections.html):

NEW page for negative learning
Correction stats KPIs
Pattern analysis
Add correction form
Correction history table

Architecture Diagrams

Classification Pipeline Flow

flowchart TB
    subgraph Entry["ENTRY POINTS"]
        API["API Request<br/>/keywords, /urls, /backlinks"]
        CRON["Cron Jobs<br/>Daily subscriptions"]
        WEBHOOK["Webhooks<br/>DataForSEO callbacks"]
    end

    subgraph Gates["GATEKEEPERS"]
        ED["ensureDomain()"]
        EU["ensureUrl()"]
    end

    subgraph Queues["CLASSIFICATION QUEUES"]
        DQ[["DOMAIN_CLASSIFY_QUEUE"]]
        UQ[["URL_CLASSIFY_QUEUE"]]
        KQ[["KEYWORD_CLASSIFY_QUEUE"]]
        LQ[["LLM_VERIFY_QUEUE"]]
    end

    subgraph Pipelines["CLASSIFICATION PIPELINES"]
        DP["Domain Pipeline<br/>7 stages"]
        UP["URL Pipeline<br/>6 stages"]
        KP["Keyword Pipeline<br/>5 stages"]
    end

    subgraph Learning["SELF-LEARNING"]
        VD["Vectorize<br/>domain-classifier<br/>>= 80%"]
        VU["Vectorize<br/>backlink-classifier<br/>>= 65%"]
        VK["Vectorize<br/>keyword-classifier<br/>>= 70%"]
    end

    subgraph Storage["D1 STORAGE"]
        DOMAINS[("domains")]
        URLS[("urls")]
        KEYWORDS[("keywords")]
        CORRECTIONS[("classification_corrections")]
    end

    subgraph Admin["ADMIN CONSOLE"]
        DASH["Dashboard<br/>KPIs & Charts"]
        REVIEW["Review Queue<br/>Low confidence"]
        CORR["Corrections<br/>Negative learning"]
    end

    API --> ED
    CRON --> ED
    WEBHOOK --> EU

    ED -->|"not classified"| DQ
    EU -->|"domain first"| ED
    EU -->|"then classify"| UQ

    DQ --> DP
    UQ --> UP
    KQ --> KP

    DP -->|">= 80%"| VD
    UP -->|">= 65%"| VU
    KP -->|">= 70%"| VK

    DP --> DOMAINS
    UP --> URLS
    KP --> KEYWORDS

    UP -->|"< 50%"| LQ
    LQ --> UP

    DOMAINS --> DASH
    URLS --> DASH
    KEYWORDS --> DASH

    DASH --> REVIEW
    REVIEW --> CORR
    CORR --> CORRECTIONS
    CORRECTIONS -->|"patterns"| VD
    CORRECTIONS -->|"patterns"| VU
    CORRECTIONS -->|"patterns"| VK

    style DP fill:#e8f5e9
    style UP fill:#e8f5e9
    style KP fill:#e8f5e9
    style VD fill:#e0f7fa
    style VU fill:#e0f7fa
    style VK fill:#e0f7fa
    style CORRECTIONS fill:#fff3e0

Domain Classification Pipeline Detail

flowchart LR
    subgraph FREE["FREE STAGES"]
        S0["0. Cache<br/>D1 lookup"]
        S1["1. Rules<br/>7,100+ domains"]
        S15["1.5 Google Ads<br/>Category hints"]
        S2["2. Vectorize<br/>Semantic similarity"]
        S3["3. Low-Noise<br/>HEAD + 8KB GET"]
    end

    subgraph PAID["PAID STAGES"]
        S4["4. Instant Pages<br/>$0.000125"]
        S45["4.5 Patterns<br/>Domain regex"]
        S5["5. LLM<br/>~$0.0001"]
    end

    S6["6. Store & Learn<br/>Save + Vectorize"]

    S0 -->|"MISS"| S1
    S1 -->|"< 70%"| S15
    S15 -->|"< 70%"| S2
    S2 -->|"< 70%"| S3
    S3 -->|"< 70%"| S4
    S4 -->|"< 70%"| S45
    S45 -->|"< 70%"| S5
    S5 --> S6

    S0 -->|">= 60%"| DONE["Done"]
    S1 -->|">= 80%"| S6
    S2 -->|">= 80%"| S6
    S3 -->|">= 70%"| S6
    S4 -->|">= 70%"| S6
    S45 -->|">= 70%"| S6

    style S0 fill:#c8e6c9
    style S1 fill:#c8e6c9
    style S15 fill:#c8e6c9
    style S2 fill:#c8e6c9
    style S3 fill:#c8e6c9
    style S4 fill:#fff3e0
    style S45 fill:#c8e6c9
    style S5 fill:#ffcdd2
    style S6 fill:#e0f7fa

Admin Console Architecture

flowchart TB
    subgraph Console["ADMIN CONSOLE (Cloudflare Pages)"]
        INDEX["index.html<br/>Dashboard"]
        DOMAINS["domains.html<br/>Domain classification"]
        URLS["urls.html<br/>URL classification"]
        KEYWORDS["keywords.html<br/>Keyword classification"]
        COSTS["costs.html<br/>Cost tracking"]
        QUEUES["queues.html<br/>Queue status"]
        CORR["corrections.html<br/>Negative learning"]
    end

    subgraph Components["REUSABLE COMPONENTS"]
        DT["DataTable<br/>Sort/Page/Search"]
        TOAST["Toast<br/>Notifications"]
        FMT["Format<br/>Number/Date/Currency"]
        BADGE["Badges<br/>Confidence/Classification"]
    end

    subgraph API["WORKER API"]
        STATS["/api/admin/classifier/*"]
        QUEUE["/api/admin/queues/*"]
        COST["/api/admin/costs/*"]
        CORR_API["/api/admin/classifier/corrections/*"]
    end

    subgraph Storage["D1 DATABASE"]
        D[("domains")]
        U[("urls")]
        K[("keywords")]
        CC[("classification_corrections")]
        CP[("correction_patterns")]
    end

    INDEX --> DT
    DOMAINS --> DT
    URLS --> DT
    KEYWORDS --> DT
    QUEUES --> DT
    CORR --> DT

    INDEX --> STATS
    DOMAINS --> STATS
    URLS --> STATS
    KEYWORDS --> STATS
    COSTS --> COST
    QUEUES --> QUEUE
    CORR --> CORR_API

    STATS --> D
    STATS --> U
    STATS --> K
    CORR_API --> CC
    CORR_API --> CP

    style INDEX fill:#e3f2fd
    style DOMAINS fill:#e8f5e9
    style URLS fill:#fff3e0
    style KEYWORDS fill:#f3e5f5
    style COSTS fill:#fce4ec
    style QUEUES fill:#e0f7fa
    style CORR fill:#fff8e1

Success Metrics

Metric	Target	Current
Domain Classification Coverage	100% before URLs	COMPLETE
Learning Rate (Vectorize)	> 60% of high-conf	COMPLETE
Correction System	Implemented	COMPLETE
Admin Console	Enterprise-grade	COMPLETE
Review Queue	< 100 items	MONITORING

Files Modified/Created

Phase 1 (Remove Bubble-Up)

src/queue/backlink-classify-consumer.js - Removed bubble-up function
src/lib/url-management.js - Domain-first enforcement

Phase 2 (Learning)

src/lib/domain-classifier.js - Learning for all stages
src/lib/url-classifier.js - Standardized learning thresholds
src/lib/classification-config.js - Centralized thresholds

Phase 3 (Negative Learning)

migrations/0119_classification_corrections.sql - New tables
src/lib/classification-corrections.js - NEW module
src/endpoints/admin-classifier.js - Correction endpoints

Phase 4 (Console)

console/css/style.css - Enterprise design system
console/js/components.js - DataTable, Toast, Format
console/js/api.js - Correction API methods
console/index.html - Complete redesign
console/pages/urls.html - Enterprise upgrade
console/pages/domains.html - Enterprise upgrade
console/pages/keywords.html - Enterprise upgrade
console/pages/costs.html - Enterprise upgrade
console/pages/queues.html - Enterprise upgrade
console/pages/corrections.html - NEW page

Future Considerations

Index Versioning (Deferred)

Weekly snapshots of Vectorize to R2
Version metadata for rollback
Not urgent, can implement if quality issues arise

Cross-Pipeline Learning (Deferred)

Keyword brand mentions → domain verification
URL classification → domain reinforcement
Always via LLM with human oversight
Queue for review, never auto-update

LLM Model Selection (Deferred)

Different models for different confidence levels
Claude for low-confidence review
Llama 3.3 70B for standard classification

Maintenance Notes

Adding New Classification Dimensions

Update pipeline stage in appropriate classifier
Add rules in classifier-rules-engine.js
Update Vectorize metadata schema
Add column to D1 table
Update admin console charts/tables

Monitoring Classification Quality

Check admin dashboard KPIs daily
Review low-confidence queue weekly
Analyze correction patterns monthly
Update rules engine based on patterns

Deploying Changes

Run migrations: wrangler d1 execute RANKFABRIC_DB --file=migrations/XXXX.sql
Deploy worker: wrangler deploy
Deploy console: wrangler pages deploy console/
Verify via admin console

Executive Summary​

Implementation Status​

Completed Work​

Sprint 1: Clean Foundation (COMPLETE)​

1.1 Removed Domain Bubble-Up​

1.2 Enforce Domain Classification Before URL​

Sprint 2: Learning Improvements (COMPLETE)​

2.1 Domain Pipeline Learning​

2.2 URL Pipeline Learning​

2.3 Keyword Pipeline​

Sprint 3: Negative Learning (COMPLETE)​

3.1 Database Schema​

3.2 Corrections Module​

3.3 Admin Endpoints​

Sprint 4: Admin Console (COMPLETE)​

4.1 Enterprise Design System​

4.2 Reusable Components​

4.3 Page Upgrades​

Architecture Diagrams​

Classification Pipeline Flow​

Domain Classification Pipeline Detail​

Admin Console Architecture​

Success Metrics​

Files Modified/Created​

Phase 1 (Remove Bubble-Up)​

Phase 2 (Learning)​

Phase 3 (Negative Learning)​

Phase 4 (Console)​

Future Considerations​

Index Versioning (Deferred)​

Cross-Pipeline Learning (Deferred)​

LLM Model Selection (Deferred)​

Maintenance Notes​

Adding New Classification Dimensions​

Monitoring Classification Quality​

Deploying Changes​