Classification Pipeline Master Plan
Executive Summary
This document tracks the implementation of critical improvements to our three classification pipelines (Domain, URL, Keyword).
STATUS: Sprints 1-4 COMPLETE (December 2024)
Implementation Status
| Sprint | Phase | Status | Notes |
|---|---|---|---|
| Sprint 1 | Remove Bubble-Up, Domain-First | COMPLETE | Removed maybeUpdateDomainClassification(), added domain-first enforcement |
| Sprint 2 | Learning Improvements | COMPLETE | All pipelines now consistently feed Vectorize |
| Sprint 3 | Negative Learning | COMPLETE | classification_corrections table, correction_patterns, API endpoints |
| Sprint 4 | Admin Console | COMPLETE | Enterprise-grade UI with DataTable, KPIs, charts |
| Sprint 5 | Polish & Monitor | COMPLETE | Documentation, mermaid diagrams updated |
Completed Work
Sprint 1: Clean Foundation (COMPLETE)
1.1 Removed Domain Bubble-Up
- Removed
maybeUpdateDomainClassification()frombacklink-classify-consumer.js - Removed all calls at lines 624 and 936
- Removed unused DOMAIN_AGGREGATION_* constants
1.2 Enforce Domain Classification Before URL
- Added domain classification check in
ensureUrl()inurl-management.js - URLs cannot be classified without their domain being classified first
- If domain not classified, URL classification waits or triggers domain classification
Sprint 2: Learning Improvements (COMPLETE)
2.1 Domain Pipeline Learning
- Moved learning to Stage 6 for ALL classification sources (not just LLM)
- Imports centralized from
classification-config.js:DOMAIN_LEARNING_MIN_CONFIDENCE(80%)shouldTriggerLearning(confidence, 'domain')
- Rules engine matches at 95%+ now feed Vectorize
- LLM results at 80%+ feed Vectorize
2.2 URL Pipeline Learning
- Standardized learning thresholds via
classification-config.js - Uses
LEARNING_MIN_CONFIDENCE(65%) consistently - Added
shouldTriggerLearning()checks in learnFromClassification()
2.3 Keyword Pipeline
- Already well-designed, learning at 70% threshold
- No changes needed, serves as model for other pipelines
Sprint 3: Negative Learning (COMPLETE)
3.1 Database Schema
Created migrations/0119_classification_corrections.sql:
-- Track classification corrections/feedback
CREATE TABLE classification_corrections (
id INTEGER PRIMARY KEY AUTOINCREMENT,
entity_type TEXT NOT NULL, -- 'domain', 'url', 'keyword'
entity_id INTEGER NOT NULL,
original_dimension TEXT NOT NULL,
original_value TEXT,
corrected_value TEXT NOT NULL,
confidence_before INTEGER,
notes TEXT,
created_by TEXT DEFAULT 'system',
created_at INTEGER NOT NULL DEFAULT (unixepoch()),
processed_at INTEGER,
UNIQUE(entity_type, entity_id, original_dimension, corrected_value)
);
-- Track correction patterns for generating rules
CREATE TABLE correction_patterns (
id INTEGER PRIMARY KEY AUTOINCREMENT,
entity_type TEXT NOT NULL,
dimension TEXT NOT NULL,
from_value TEXT,
to_value TEXT NOT NULL,
count INTEGER DEFAULT 1,
suggested_rule TEXT,
created_at INTEGER NOT NULL DEFAULT (unixepoch()),
last_seen_at INTEGER NOT NULL DEFAULT (unixepoch()),
UNIQUE(entity_type, dimension, from_value, to_value)
);
3.2 Corrections Module
Created src/lib/classification-corrections.js:
recordCorrection()- Save correction and update entitygetPendingCorrections()- Get unprocessed correctionsgetSuggestedRules()- Analyze patterns for auto-rulesgetCorrectionHistory()- View correction historygetCorrectionStats()- Dashboard statisticsprocessPendingCorrections()- Batch process for Vectorize feedback
3.3 Admin Endpoints
Added to src/endpoints/admin-classifier.js:
POST /api/admin/classifier/corrections- Submit correctionGET /api/admin/classifier/corrections/stats- Dashboard statsGET /api/admin/classifier/corrections/history/:type/:id- Entity historyGET /api/admin/classifier/corrections/patterns- Suggested rulesPOST /api/admin/classifier/corrections/learn- Process pending
Sprint 4: Admin Console (COMPLETE)
4.1 Enterprise Design System
Complete rewrite of console/css/style.css:
- CSS custom properties for theming
- Dark theme with professional color palette
- KPI cards with trends and colors
- Progress bars and badges
- Toast notifications
- Responsive grid layouts
4.2 Reusable Components
Created console/js/components.js:
- DataTable class with:
- Sorting (click headers)
- Pagination (configurable page sizes)
- Search (debounced)
- Custom cell renderers
- Loading/empty states
- Toast notification system
- Format helpers (number, percent, currency, date, relative)
- confidenceBadge() - Color-coded confidence display
- classificationBadge() - Classification value badges
4.3 Page Upgrades
Main Dashboard (console/index.html):
- New sidebar with logo and nav sections
- System KPIs with trend indicators
- Entity cards with progress bars
- Health table with status badges
- Distribution charts
URLs Page (console/pages/urls.html):
- Tabbed interface (Overview, Distributions, Recent, Needs Review)
- DataTable with full sorting/pagination
- Source filter dropdown
- Confidence distribution charts
Domains Page (console/pages/domains.html):
- Tabbed interface
- Confidence by dimension table
- Domain type distribution charts
- Classify unclassified button
Keywords Page (console/pages/keywords.html):
- 11 dimension charts
- Brand and location stats
- Journey/Intent confidence KPIs
- Random sample loader
Costs Page (console/pages/costs.html):
- Budget tracking with alerts
- Daily cost line chart
- Service breakdown pie charts
- Cost per request efficiency
- Export CSV button
Queues Page (console/pages/queues.html):
- Queue cards grid with status badges
- Processing rate chart
- Failed messages table
- Retry all / Retry individual buttons
- Real-time status indicators
Corrections Page (console/pages/corrections.html):
- NEW page for negative learning
- Correction stats KPIs
- Pattern analysis
- Add correction form
- Correction history table
Architecture Diagrams
Classification Pipeline Flow
flowchart TB
subgraph Entry["ENTRY POINTS"]
API["API Request<br/>/keywords, /urls, /backlinks"]
CRON["Cron Jobs<br/>Daily subscriptions"]
WEBHOOK["Webhooks<br/>DataForSEO callbacks"]
end
subgraph Gates["GATEKEEPERS"]
ED["ensureDomain()"]
EU["ensureUrl()"]
end
subgraph Queues["CLASSIFICATION QUEUES"]
DQ[["DOMAIN_CLASSIFY_QUEUE"]]
UQ[["URL_CLASSIFY_QUEUE"]]
KQ[["KEYWORD_CLASSIFY_QUEUE"]]
LQ[["LLM_VERIFY_QUEUE"]]
end
subgraph Pipelines["CLASSIFICATION PIPELINES"]
DP["Domain Pipeline<br/>7 stages"]
UP["URL Pipeline<br/>6 stages"]
KP["Keyword Pipeline<br/>5 stages"]
end
subgraph Learning["SELF-LEARNING"]
VD["Vectorize<br/>domain-classifier<br/>>= 80%"]
VU["Vectorize<br/>backlink-classifier<br/>>= 65%"]
VK["Vectorize<br/>keyword-classifier<br/>>= 70%"]
end
subgraph Storage["D1 STORAGE"]
DOMAINS[("domains")]
URLS[("urls")]
KEYWORDS[("keywords")]
CORRECTIONS[("classification_corrections")]
end
subgraph Admin["ADMIN CONSOLE"]
DASH["Dashboard<br/>KPIs & Charts"]
REVIEW["Review Queue<br/>Low confidence"]
CORR["Corrections<br/>Negative learning"]
end
API --> ED
CRON --> ED
WEBHOOK --> EU
ED -->|"not classified"| DQ
EU -->|"domain first"| ED
EU -->|"then classify"| UQ
DQ --> DP
UQ --> UP
KQ --> KP
DP -->|">= 80%"| VD
UP -->|">= 65%"| VU
KP -->|">= 70%"| VK
DP --> DOMAINS
UP --> URLS
KP --> KEYWORDS
UP -->|"< 50%"| LQ
LQ --> UP
DOMAINS --> DASH
URLS --> DASH
KEYWORDS --> DASH
DASH --> REVIEW
REVIEW --> CORR
CORR --> CORRECTIONS
CORRECTIONS -->|"patterns"| VD
CORRECTIONS -->|"patterns"| VU
CORRECTIONS -->|"patterns"| VK
style DP fill:#e8f5e9
style UP fill:#e8f5e9
style KP fill:#e8f5e9
style VD fill:#e0f7fa
style VU fill:#e0f7fa
style VK fill:#e0f7fa
style CORRECTIONS fill:#fff3e0
Domain Classification Pipeline Detail
flowchart LR
subgraph FREE["FREE STAGES"]
S0["0. Cache<br/>D1 lookup"]
S1["1. Rules<br/>7,100+ domains"]
S15["1.5 Google Ads<br/>Category hints"]
S2["2. Vectorize<br/>Semantic similarity"]
S3["3. Low-Noise<br/>HEAD + 8KB GET"]
end
subgraph PAID["PAID STAGES"]
S4["4. Instant Pages<br/>$0.000125"]
S45["4.5 Patterns<br/>Domain regex"]
S5["5. LLM<br/>~$0.0001"]
end
S6["6. Store & Learn<br/>Save + Vectorize"]
S0 -->|"MISS"| S1
S1 -->|"< 70%"| S15
S15 -->|"< 70%"| S2
S2 -->|"< 70%"| S3
S3 -->|"< 70%"| S4
S4 -->|"< 70%"| S45
S45 -->|"< 70%"| S5
S5 --> S6
S0 -->|">= 60%"| DONE["Done"]
S1 -->|">= 80%"| S6
S2 -->|">= 80%"| S6
S3 -->|">= 70%"| S6
S4 -->|">= 70%"| S6
S45 -->|">= 70%"| S6
style S0 fill:#c8e6c9
style S1 fill:#c8e6c9
style S15 fill:#c8e6c9
style S2 fill:#c8e6c9
style S3 fill:#c8e6c9
style S4 fill:#fff3e0
style S45 fill:#c8e6c9
style S5 fill:#ffcdd2
style S6 fill:#e0f7fa
Admin Console Architecture
flowchart TB
subgraph Console["ADMIN CONSOLE (Cloudflare Pages)"]
INDEX["index.html<br/>Dashboard"]
DOMAINS["domains.html<br/>Domain classification"]
URLS["urls.html<br/>URL classification"]
KEYWORDS["keywords.html<br/>Keyword classification"]
COSTS["costs.html<br/>Cost tracking"]
QUEUES["queues.html<br/>Queue status"]
CORR["corrections.html<br/>Negative learning"]
end
subgraph Components["REUSABLE COMPONENTS"]
DT["DataTable<br/>Sort/Page/Search"]
TOAST["Toast<br/>Notifications"]
FMT["Format<br/>Number/Date/Currency"]
BADGE["Badges<br/>Confidence/Classification"]
end
subgraph API["WORKER API"]
STATS["/api/admin/classifier/*"]
QUEUE["/api/admin/queues/*"]
COST["/api/admin/costs/*"]
CORR_API["/api/admin/classifier/corrections/*"]
end
subgraph Storage["D1 DATABASE"]
D[("domains")]
U[("urls")]
K[("keywords")]
CC[("classification_corrections")]
CP[("correction_patterns")]
end
INDEX --> DT
DOMAINS --> DT
URLS --> DT
KEYWORDS --> DT
QUEUES --> DT
CORR --> DT
INDEX --> STATS
DOMAINS --> STATS
URLS --> STATS
KEYWORDS --> STATS
COSTS --> COST
QUEUES --> QUEUE
CORR --> CORR_API
STATS --> D
STATS --> U
STATS --> K
CORR_API --> CC
CORR_API --> CP
style INDEX fill:#e3f2fd
style DOMAINS fill:#e8f5e9
style URLS fill:#fff3e0
style KEYWORDS fill:#f3e5f5
style COSTS fill:#fce4ec
style QUEUES fill:#e0f7fa
style CORR fill:#fff8e1
Success Metrics
| Metric | Target | Current |
|---|---|---|
| Domain Classification Coverage | 100% before URLs | COMPLETE |
| Learning Rate (Vectorize) | > 60% of high-conf | COMPLETE |
| Correction System | Implemented | COMPLETE |
| Admin Console | Enterprise-grade | COMPLETE |
| Review Queue | < 100 items | MONITORING |
Files Modified/Created
Phase 1 (Remove Bubble-Up)
src/queue/backlink-classify-consumer.js- Removed bubble-up functionsrc/lib/url-management.js- Domain-first enforcement
Phase 2 (Learning)
src/lib/domain-classifier.js- Learning for all stagessrc/lib/url-classifier.js- Standardized learning thresholdssrc/lib/classification-config.js- Centralized thresholds
Phase 3 (Negative Learning)
migrations/0119_classification_corrections.sql- New tablessrc/lib/classification-corrections.js- NEW modulesrc/endpoints/admin-classifier.js- Correction endpoints
Phase 4 (Console)
console/css/style.css- Enterprise design systemconsole/js/components.js- DataTable, Toast, Formatconsole/js/api.js- Correction API methodsconsole/index.html- Complete redesignconsole/pages/urls.html- Enterprise upgradeconsole/pages/domains.html- Enterprise upgradeconsole/pages/keywords.html- Enterprise upgradeconsole/pages/costs.html- Enterprise upgradeconsole/pages/queues.html- Enterprise upgradeconsole/pages/corrections.html- NEW page
Future Considerations
Index Versioning (Deferred)
- Weekly snapshots of Vectorize to R2
- Version metadata for rollback
- Not urgent, can implement if quality issues arise
Cross-Pipeline Learning (Deferred)
- Keyword brand mentions → domain verification
- URL classification → domain reinforcement
- Always via LLM with human oversight
- Queue for review, never auto-update
LLM Model Selection (Deferred)
- Different models for different confidence levels
- Claude for low-confidence review
- Llama 3.3 70B for standard classification
Maintenance Notes
Adding New Classification Dimensions
- Update pipeline stage in appropriate classifier
- Add rules in
classifier-rules-engine.js - Update Vectorize metadata schema
- Add column to D1 table
- Update admin console charts/tables
Monitoring Classification Quality
- Check admin dashboard KPIs daily
- Review low-confidence queue weekly
- Analyze correction patterns monthly
- Update rules engine based on patterns
Deploying Changes
- Run migrations:
wrangler d1 execute RANKFABRIC_DB --file=migrations/XXXX.sql - Deploy worker:
wrangler deploy - Deploy console:
wrangler pages deploy console/ - Verify via admin console