Classification Dimensions - Source of Truth
This document is the single source of truth for all classification dimensions used in the system.
Overview
There are 3 classification pipelines:
| Pipeline | What Gets Classified | Cost Model | Learning |
|---|
| Domain Classification | The business/website (e.g., spotify.com) | FREE stages first, then PAID | LLM >=80% → Vectorize |
| URL Classification | The specific page (e.g., spotify.com/premium) | Inherits domain, adds page-level | LLM >=65% → Vectorize |
| Keyword Classification | Search queries (e.g., "best project management software") | Rules → Vectorize → LLM | LLM >=80% → Vectorize |
Domain-Level Dimensions
These describe what the business/website IS - set once per domain, cached, reused for all URLs.
| Dimension | Column Name | Description | Valid Values |
|---|
| tier1_type | tier1_type | High-level business archetype (7 types) | platform, marketplace, commerce, service, information, community, institutional, unknown |
| domain_type | domain_type | Specific business type (~45 types, constrained by tier1) | See Domain Types by Tier1 |
| ownership_type | ownership_type | Who controls this content | owned, ugc_platform, third_party_publisher, partner_site, integration_partner, unknown |
| channel_bucket | channel_bucket | Marketing channel classification | pr_earned_media, institutional_authority, citation_local, marketplace_listing, affiliate_partner, influencer_campaign, sponsorship_event, email_newsletter, owned_content_marketing, owned_brand_site, ugc_community, programmatic_seo, risky_blackhat, unknown |
| media_type | media_type | PESO model classification | owned, earned, paid, shared, hybrid, unknown |
| quality_tier | quality_tier | Publisher authority level | tier_1 (DR 80+), tier_2 (60-79), tier_3 (40-59), tier_4 (20-39), tier_5 (0-19), unrated |
| industry | industry | Vertical (optional) | sports, gaming, finance, healthcare, technology, etc. |
| subcategory | subcategory | Specific niche (optional) | Free text, e.g., mobile_gaming, telehealth, saas |
Confidence Columns (Per-Dimension)
Each dimension has its own confidence score (0-100):
| Column | Description |
|---|
tier1_confidence | Confidence in tier1_type |
domain_type_confidence | Confidence in domain_type |
classification_confidence | Overall/legacy confidence |
| Column | Description |
|---|
classification_source | How it was classified: rule, v2 (domain database), vectorize, llm, merged |
classification_context | JSON with signals, rules applied |
last_classified_at | Unix timestamp |
Domain Types by Tier1
The domain_type is constrained by tier1_type. These are the ONLY valid combinations:
| domain_type | Examples |
|---|
saas_product | Slack, Notion, Asana, Salesforce |
code_repository | GitHub, GitLab, Bitbucket |
app_platform | App Store, Google Play |
documentation_portal | GitBook, ReadTheDocs |
messaging_platform | Slack, Discord, WhatsApp |
social_network | LinkedIn, Facebook, Twitter |
audio_platform | Spotify, Apple Podcasts |
video_platform | YouTube, Vimeo, Twitch |
MARKETPLACE (connects buyers and sellers)
| domain_type | Examples |
|---|
ecommerce_marketplace | Amazon, eBay, Etsy |
ticket_marketplace | Ticketmaster, StubHub |
real_estate_marketplace | Zillow, Realtor.com |
job_marketplace | Indeed, LinkedIn Jobs |
service_marketplace | Upwork, Fiverr, Thumbtack |
app_marketplace | Chrome Web Store, Salesforce AppExchange |
review_marketplace | G2, Capterra, Yelp |
directory_citation | Yellow Pages, BBB |
COMMERCE (sells products directly)
| domain_type | Examples |
|---|
ecommerce_store | Nike.com, Warby Parker |
travel_booking | Delta, Marriott, Hertz |
subscription_commerce | Netflix, Dollar Shave Club |
product_manufacturer | Apple, Samsung, Ford |
SERVICE (sells services)
| domain_type | Examples |
|---|
agency_provider | Ogilvy, McKinsey |
pr_distribution | PR Newswire, Business Wire |
professional_service | Deloitte, Baker McKenzie |
healthcare_provider | Mayo Clinic, Teladoc |
financial_service | Chase, PayPal, Stripe |
legal_service | LegalZoom |
INFORMATION (content is the product)
| domain_type | Examples |
|---|
news_publisher | NYT, TechCrunch, The Verge |
magazine_publisher | Wired, Forbes, Inc. |
blog_publisher | Medium, Substack |
content_publisher | BuzzFeed, HuffPost |
review_site | Wirecutter, Consumer Reports |
affiliate_review_site | NerdWallet, The Points Guy |
reference_wiki | Wikipedia, Investopedia |
| domain_type | Examples |
|---|
forum_community | Reddit, Stack Overflow |
gaming_community | IGN forums, Discord servers |
sports_community | ESPN forums, Fantasy Pros |
qna_platform | Quora, Stack Exchange |
ugc_video | YouTube (as community), TikTok |
INSTITUTIONAL (authority-based orgs)
| domain_type | Examples |
|---|
government_site | IRS.gov, CDC.gov |
education_academic | Stanford.edu, MIT.edu |
nonprofit_org | Red Cross, Wikipedia Foundation |
healthcare_institution | Mayo Clinic, Cleveland Clinic |
financial_institution | Federal Reserve, FDIC |
legal_institution | Supreme Court, ABA |
trade_association | IEEE, ACM, NAR |
UNKNOWN/RISK
| domain_type | Description |
|---|
pbn_suspected | Private blog network |
spam_low_quality | Spam/low quality |
unknown_other | Fallback |
URL-Level Dimensions
These describe what the specific page IS - classified per URL, inherits domain-level info.
| Dimension | Column Name | Description | Valid Values |
|---|
| page_type | page_type | Type of page (~40 types) | See Page Types |
| page_category | page_category | High-level page category | editorial, commercial, ugc, programmatic, utility, asset, risky |
| tactic_type | tactic_type | Marketing tactic detected (~50 types) | See Tactic Types |
| is_money_page | is_money_page | High-value conversion page flag | 0 or 1 |
| url_pattern | url_pattern | Pattern that matched | Free text |
| modifiers | Stored in context | Boolean signals | See Modifiers |
Page Types
| Category | Page Types |
|---|
| Editorial | news_article, opinion_article, feature_article, blog_post, press_release, research_article, resource_guide, howto_article |
| Commercial | product_page, category_page, brand_page, comparison_page, buying_guide, review_page, landing_page, sales_page, checkout_page, pricing_page |
| UGC | forum_thread, forum_post, subreddit_index, qna_page, ugc_article, profile_page, comment_thread, video_page, playlist_page, channel_page, social_post, repository_page |
| Programmatic | location_page, directory_listing, auto_generated_comparison, search_results_page, tag_archive_page, category_index_page |
| Utility | homepage, about_page, contact_page, careers_page, documentation_page, api_reference_page, login_page, signup_page, settings_page, legal_privacy_page, legal_terms_page, faq_page, support_article |
| Asset | pdf_document, image_asset, video_asset, download_page |
| Risk | comment_spam_page, pbn_article_page, malicious_page, parked_domain_page |
Tactic Types
| Category | Tactic Types |
|---|
| PR/Earned | pr_funding_announcement, pr_product_launch, pr_partnership_announcement, pr_wire_distribution, pr_award_announcement, pr_thought_leadership_feature, pr_generic_news_coverage, pr_crisis_coverage, pr_data_report_feature |
| HARO/Expert | haro_expert_quote, haro_expert_roundup, haro_data_contribution, haro_case_study_feature, expert_panel_feature |
| Link Building | guest_post_editorial, guest_post_paid, niche_edit_insertion, resource_page_outreach, broken_link_replacement, skyscraper_outreach, scholarship_link, infographic_embed |
| Affiliate | affiliate_top_listicle, affiliate_single_brand_review, affiliate_comparison_review, affiliate_coupon_deals_page, affiliate_buyer_guide |
| UGC | ugc_forum_organic_thread, ugc_forum_seeded_promo, ugc_qna_answer_seeded, ugc_qna_organic, ugc_reddit_organic_discussion, ugc_reddit_astroturf, ugc_video_description_link, ugc_profile_link, ugc_social_organic_post, ugc_social_astroturf_post |
| Owned | owned_blog_editorial, owned_case_study, owned_feature_landing, owned_resource_hub, owned_doc_or_guide, owned_release_on_own_site, owned_gated_leadgen_asset, owned_webinar_page, owned_podcast_episode_page, owned_changelog_release_notes |
| Programmatic | programmatic_location_page, programmatic_directory_entry, programmatic_comparison_template, programmatic_longtail_template |
| Influencer | influencer_review_post, influencer_social_post, sponsorship_podcast_mention_page, sponsorship_event_page, sponsorship_webinar_hosted_by_partner |
| Marketplace | marketplace_app_listing, marketplace_vendor_profile, marketplace_review_section |
| Blackhat | blackhat_pbn_link, blackhat_comment_spam, blackhat_hacked_site_injection, blackhat_link_farm_page |
Modifiers (Boolean Signals)
| Category | Modifiers |
|---|
| Content | has_video, has_audio, has_gallery, has_infographic, has_interactive, user_reviews_present, ugc_present, high_engagement, paywalled |
| Commercial | affiliate_links_present, sponsored_disclosure_present, nofollow_link, dofollow_link, ugc_rel_tag, sponsored_rel_tag |
| Quality/Risk | thin_content, auto_generated, low_quality_template_match, keyword_stuffed, excessive_ads, popup_heavy, broken_layout |
| Foreign/Spam | foreign_language_primary, foreign_language_mixed, non_latin_script, foreign_tld_english_content, machine_translated, gibberish_detected |
| Technical | http_only, redirect_chain, parked_domain, expired_domain |
| Expert/HARO | expert_byline, multiple_expert_quotes, data_citation, case_study_format |
Keyword-Level Dimensions
These describe search queries - classified per keyword.
v2 Dimensions (Recommended)
| Dimension | Column Name | Description | Valid Values |
|---|
| journey_moment | classification_journey_moment | Point in buyer journey (12 values) | problem_unaware, problem_aware, solution_curious, category_exploring, feature_comparing, option_evaluating, provider_selecting, purchase_ready, purchase_completing, onboarding, optimizing, advocating |
| journey_direction | classification_journey_direction | Movement through journey | entering, progressing, stalled, regressing, exiting |
| expertise_level | classification_expertise_level | Knowledge level | novice, intermediate, advanced, expert |
| buyer_behavior | classification_buyer_behavior | Purchase approach | impulsive, methodical, price_sensitive, quality_focused, brand_loyal, research_heavy |
| role_context | classification_role_context | Who is searching | individual_consumer, small_business_owner, enterprise_buyer, developer, marketer, student, professional, hobbyist |
| demand_pattern | classification_demand_pattern | When demand occurs | evergreen, seasonal_recurring, event_driven, trending_spike, declining |
| content_decay | classification_content_decay | How quickly content goes stale | timeless, slow_decay, moderate_decay, fast_decay, real_time |
| query_specificity | classification_query_specificity | Query length/specificity | head, torso, long_tail, ultra_long_tail |
| has_brand_mention | classification_has_brand_mention | Contains brand name | 0 or 1 |
| brand_mentioned | classification_brand_mentioned | The brand name if detected | Free text |
v1 Dimensions (Backwards Compatible)
| Dimension | Column Name | Valid Values |
|---|
| funnel_stage | classification_funnel_stage | awareness, consideration, decision, retention, advocacy |
| intent_type | classification_intent_type | informational, navigational, commercial_investigation, transactional, local, support |
| keyword_pattern | classification_keyword_pattern | question, comparison, superlative, problem, use_case, etc. |
| ranking_difficulty | classification_ranking_difficulty | very_easy, easy, medium, hard, very_hard |
| trend_velocity | classification_trend_velocity | rising_fast, rising_slow, stable, declining, volatile |
| seasonality | classification_seasonality | evergreen, seasonal, event_driven |
| competitive_density | classification_competitive_density | low, medium, high, dominated |
| commercial_value_index | classification_commercial_value_index | Numeric (volume x CPC) |
| sentiment_intent | classification_sentiment_intent | positive, neutral, negative, urgent |
| content_format_intent | classification_content_format_intent | listicle, comparison_review, how_to_guide, video_preferred, etc. |
Classification Pipeline Stages
Domain Classification (7 Stages)
| Stage | Name | Cost | What It Does |
|---|
| 0 | Cache | FREE | Check D1 for existing classification |
| 1 | Rules | FREE | Known domains, TLDs (.gov/.edu), platform patterns |
| 1.5 | Google Ads Categories | FREE | Use cached DFS category data → tier1_type hint |
| 2 | Vectorize | FREE | Semantic similarity to labeled domains |
| 3 | Low-Noise Crawl | FREE | HEAD + partial GET (8KB), CMS/og:type detection |
| 4 | Instant Pages | $0.000125 | DataForSEO full page fetch |
| 4.5 | Domain Patterns | FREE | Fallback rules for placeholder pages |
| 5 | LLM | ~$0.0001 | Workers AI for ambiguous cases |
Early exit: If confidence >= 70% at stages 3 or 4, skip remaining stages.
Self-learning: LLM classifications with >= 80% confidence are upserted to Vectorize.
URL Classification (5 Stages)
| Stage | Name | Cost | What It Does |
|---|
| 0 | Domain Cache | FREE | Use cached domain classification |
| 1 | Rules | FREE | URL patterns, known sites |
| 2 | Vectorize | FREE | Similarity to labeled URLs |
| 3 | Content Parse | $0.000125 | DataForSEO Instant Pages |
| 4 | LLM | ~$0.0001 | Workers AI fallback |
Self-learning: LLM classifications with >= 65% confidence are upserted to Vectorize.
Keyword Classification (4 Stages)
| Stage | Name | Cost | Coverage |
|---|
| 1 | Rules | FREE | ~60% of dimensions |
| 2 | Vectorize | ~$0.00001 | ~25% of dimensions |
| 3 | Brand Lookup | FREE | 1 dimension |
| 4 | LLM | ~$0.0001 | ~15% (fallback) |
Self-learning: LLM classifications with >= 80% confidence are upserted to Vectorize.
Source Files
| File | Purpose |
|---|
src/lib/classification-constants.js | All enum definitions, mappings, helpers |
src/data/domain-database.js | Generated from master CSV (4,400+ domains) |
classification-data/domains/_master.csv | Curated domain classifications |
src/lib/classifier-rules-engine.js | Stage 1: Rules engine |
src/lib/classifier-vectorize.js | Stage 2: Vectorize similarity |
src/lib/low-noise-crawler.js | Stage 3: FREE homepage metadata extraction |
src/lib/classifier-content-parser.js | Stage 4: DataForSEO Instant Pages |
src/lib/classifier-llm.js | Stage 5: Workers AI LLM |
src/lib/domain-classifier.js | Domain classification orchestrator |
src/lib/backlink-classifier.js | URL classification orchestrator |
src/lib/target-url-classifier.js | Target URL classification (rules only, FREE) |
src/lib/keyword-classifier.js | Keyword classification orchestrator |
Database Tables
domains
domain TEXT PRIMARY KEY,
tier1_type TEXT,
domain_type TEXT,
ownership_type TEXT,
channel_bucket TEXT,
media_type TEXT,
quality_tier TEXT,
industry TEXT,
subcategory TEXT,
tier1_confidence REAL,
domain_type_confidence REAL,
classification_source TEXT,
classification_confidence INTEGER,
classification_context TEXT,
last_classified_at INTEGER
urls
url TEXT PRIMARY KEY,
domain TEXT,
domain_id INTEGER,
page_type TEXT,
page_category TEXT,
tactic_type TEXT,
is_money_page INTEGER DEFAULT 0,
url_pattern TEXT,
classification_source TEXT,
classification_confidence INTEGER,
last_classified_at INTEGER
keywords
keyword TEXT,
classification_funnel_stage TEXT,
classification_intent_type TEXT,
classification_journey_moment TEXT,
classification_expertise_level TEXT,
classification_buyer_behavior TEXT,
classification_query_specificity TEXT,
classification_has_brand_mention INTEGER,
classification_brand_mentioned TEXT,
classification_source TEXT,
classification_version INTEGER DEFAULT 2,
classification_completed_at INTEGER
Current State Issues
Database vs Master CSV Mismatch
The database contains ~50 domain_type values from LLM classifications using an older taxonomy. The master CSV contains ~25 domain_type values using the V3 taxonomy.
Database domain_types (sample of ~8,000 domains):
saas_product (4441) - Valid V3
service_business (852) - NOT V3, should be professional_service or agency_provider
forum_community_board (413) - NOT V3, should be forum_community
blog_content_site (278) - NOT V3, should be blog_publisher or content_publisher
entertainment_streaming (235) - NOT V3, should be video_platform or audio_platform
telehealth_provider (216) - NOT V3, should be healthcare_provider
personal_site_portfolio (175) - NOT V3, no direct equivalent
- etc.
Master CSV domain_types (sample of ~4,400 domains):
saas_product (1639)
ecommerce_store (536)
government_site (257)
travel_booking (256)
financial_institution (245)
- etc.
Fix Required
- Create mapping from old LLM types → V3 types
- Update database with corrected types
- Re-seed Vectorize with V3 types
- Update LLM prompts to output V3 types only