Skip to main content

Classification Dimensions - Source of Truth

This document is the single source of truth for all classification dimensions used in the system.


Overview

There are 3 classification pipelines:

PipelineWhat Gets ClassifiedCost ModelLearning
Domain ClassificationThe business/website (e.g., spotify.com)FREE stages first, then PAIDLLM >=80% → Vectorize
URL ClassificationThe specific page (e.g., spotify.com/premium)Inherits domain, adds page-levelLLM >=65% → Vectorize
Keyword ClassificationSearch queries (e.g., "best project management software")Rules → Vectorize → LLMLLM >=80% → Vectorize

Domain-Level Dimensions

These describe what the business/website IS - set once per domain, cached, reused for all URLs.

DimensionColumn NameDescriptionValid Values
tier1_typetier1_typeHigh-level business archetype (7 types)platform, marketplace, commerce, service, information, community, institutional, unknown
domain_typedomain_typeSpecific business type (~45 types, constrained by tier1)See Domain Types by Tier1
ownership_typeownership_typeWho controls this contentowned, ugc_platform, third_party_publisher, partner_site, integration_partner, unknown
channel_bucketchannel_bucketMarketing channel classificationpr_earned_media, institutional_authority, citation_local, marketplace_listing, affiliate_partner, influencer_campaign, sponsorship_event, email_newsletter, owned_content_marketing, owned_brand_site, ugc_community, programmatic_seo, risky_blackhat, unknown
media_typemedia_typePESO model classificationowned, earned, paid, shared, hybrid, unknown
quality_tierquality_tierPublisher authority leveltier_1 (DR 80+), tier_2 (60-79), tier_3 (40-59), tier_4 (20-39), tier_5 (0-19), unrated
industryindustryVertical (optional)sports, gaming, finance, healthcare, technology, etc.
subcategorysubcategorySpecific niche (optional)Free text, e.g., mobile_gaming, telehealth, saas

Confidence Columns (Per-Dimension)

Each dimension has its own confidence score (0-100):

ColumnDescription
tier1_confidenceConfidence in tier1_type
domain_type_confidenceConfidence in domain_type
classification_confidenceOverall/legacy confidence

Metadata Columns

ColumnDescription
classification_sourceHow it was classified: rule, v2 (domain database), vectorize, llm, merged
classification_contextJSON with signals, rules applied
last_classified_atUnix timestamp

Domain Types by Tier1

The domain_type is constrained by tier1_type. These are the ONLY valid combinations:

PLATFORM (tools you log into)

domain_typeExamples
saas_productSlack, Notion, Asana, Salesforce
code_repositoryGitHub, GitLab, Bitbucket
app_platformApp Store, Google Play
documentation_portalGitBook, ReadTheDocs
messaging_platformSlack, Discord, WhatsApp
social_networkLinkedIn, Facebook, Twitter
audio_platformSpotify, Apple Podcasts
video_platformYouTube, Vimeo, Twitch

MARKETPLACE (connects buyers and sellers)

domain_typeExamples
ecommerce_marketplaceAmazon, eBay, Etsy
ticket_marketplaceTicketmaster, StubHub
real_estate_marketplaceZillow, Realtor.com
job_marketplaceIndeed, LinkedIn Jobs
service_marketplaceUpwork, Fiverr, Thumbtack
app_marketplaceChrome Web Store, Salesforce AppExchange
review_marketplaceG2, Capterra, Yelp
directory_citationYellow Pages, BBB

COMMERCE (sells products directly)

domain_typeExamples
ecommerce_storeNike.com, Warby Parker
travel_bookingDelta, Marriott, Hertz
subscription_commerceNetflix, Dollar Shave Club
product_manufacturerApple, Samsung, Ford

SERVICE (sells services)

domain_typeExamples
agency_providerOgilvy, McKinsey
pr_distributionPR Newswire, Business Wire
professional_serviceDeloitte, Baker McKenzie
healthcare_providerMayo Clinic, Teladoc
financial_serviceChase, PayPal, Stripe
legal_serviceLegalZoom

INFORMATION (content is the product)

domain_typeExamples
news_publisherNYT, TechCrunch, The Verge
magazine_publisherWired, Forbes, Inc.
blog_publisherMedium, Substack
content_publisherBuzzFeed, HuffPost
review_siteWirecutter, Consumer Reports
affiliate_review_siteNerdWallet, The Points Guy
reference_wikiWikipedia, Investopedia

COMMUNITY (UGC-dominated)

domain_typeExamples
forum_communityReddit, Stack Overflow
gaming_communityIGN forums, Discord servers
sports_communityESPN forums, Fantasy Pros
qna_platformQuora, Stack Exchange
ugc_videoYouTube (as community), TikTok

INSTITUTIONAL (authority-based orgs)

domain_typeExamples
government_siteIRS.gov, CDC.gov
education_academicStanford.edu, MIT.edu
nonprofit_orgRed Cross, Wikipedia Foundation
healthcare_institutionMayo Clinic, Cleveland Clinic
financial_institutionFederal Reserve, FDIC
legal_institutionSupreme Court, ABA
trade_associationIEEE, ACM, NAR

UNKNOWN/RISK

domain_typeDescription
pbn_suspectedPrivate blog network
spam_low_qualitySpam/low quality
unknown_otherFallback

URL-Level Dimensions

These describe what the specific page IS - classified per URL, inherits domain-level info.

DimensionColumn NameDescriptionValid Values
page_typepage_typeType of page (~40 types)See Page Types
page_categorypage_categoryHigh-level page categoryeditorial, commercial, ugc, programmatic, utility, asset, risky
tactic_typetactic_typeMarketing tactic detected (~50 types)See Tactic Types
is_money_pageis_money_pageHigh-value conversion page flag0 or 1
url_patternurl_patternPattern that matchedFree text
modifiersStored in contextBoolean signalsSee Modifiers

Page Types

CategoryPage Types
Editorialnews_article, opinion_article, feature_article, blog_post, press_release, research_article, resource_guide, howto_article
Commercialproduct_page, category_page, brand_page, comparison_page, buying_guide, review_page, landing_page, sales_page, checkout_page, pricing_page
UGCforum_thread, forum_post, subreddit_index, qna_page, ugc_article, profile_page, comment_thread, video_page, playlist_page, channel_page, social_post, repository_page
Programmaticlocation_page, directory_listing, auto_generated_comparison, search_results_page, tag_archive_page, category_index_page
Utilityhomepage, about_page, contact_page, careers_page, documentation_page, api_reference_page, login_page, signup_page, settings_page, legal_privacy_page, legal_terms_page, faq_page, support_article
Assetpdf_document, image_asset, video_asset, download_page
Riskcomment_spam_page, pbn_article_page, malicious_page, parked_domain_page

Tactic Types

CategoryTactic Types
PR/Earnedpr_funding_announcement, pr_product_launch, pr_partnership_announcement, pr_wire_distribution, pr_award_announcement, pr_thought_leadership_feature, pr_generic_news_coverage, pr_crisis_coverage, pr_data_report_feature
HARO/Expertharo_expert_quote, haro_expert_roundup, haro_data_contribution, haro_case_study_feature, expert_panel_feature
Link Buildingguest_post_editorial, guest_post_paid, niche_edit_insertion, resource_page_outreach, broken_link_replacement, skyscraper_outreach, scholarship_link, infographic_embed
Affiliateaffiliate_top_listicle, affiliate_single_brand_review, affiliate_comparison_review, affiliate_coupon_deals_page, affiliate_buyer_guide
UGCugc_forum_organic_thread, ugc_forum_seeded_promo, ugc_qna_answer_seeded, ugc_qna_organic, ugc_reddit_organic_discussion, ugc_reddit_astroturf, ugc_video_description_link, ugc_profile_link, ugc_social_organic_post, ugc_social_astroturf_post
Ownedowned_blog_editorial, owned_case_study, owned_feature_landing, owned_resource_hub, owned_doc_or_guide, owned_release_on_own_site, owned_gated_leadgen_asset, owned_webinar_page, owned_podcast_episode_page, owned_changelog_release_notes
Programmaticprogrammatic_location_page, programmatic_directory_entry, programmatic_comparison_template, programmatic_longtail_template
Influencerinfluencer_review_post, influencer_social_post, sponsorship_podcast_mention_page, sponsorship_event_page, sponsorship_webinar_hosted_by_partner
Marketplacemarketplace_app_listing, marketplace_vendor_profile, marketplace_review_section
Blackhatblackhat_pbn_link, blackhat_comment_spam, blackhat_hacked_site_injection, blackhat_link_farm_page

Modifiers (Boolean Signals)

CategoryModifiers
Contenthas_video, has_audio, has_gallery, has_infographic, has_interactive, user_reviews_present, ugc_present, high_engagement, paywalled
Commercialaffiliate_links_present, sponsored_disclosure_present, nofollow_link, dofollow_link, ugc_rel_tag, sponsored_rel_tag
Quality/Riskthin_content, auto_generated, low_quality_template_match, keyword_stuffed, excessive_ads, popup_heavy, broken_layout
Foreign/Spamforeign_language_primary, foreign_language_mixed, non_latin_script, foreign_tld_english_content, machine_translated, gibberish_detected
Technicalhttp_only, redirect_chain, parked_domain, expired_domain
Expert/HAROexpert_byline, multiple_expert_quotes, data_citation, case_study_format

Keyword-Level Dimensions

These describe search queries - classified per keyword.

DimensionColumn NameDescriptionValid Values
journey_momentclassification_journey_momentPoint in buyer journey (12 values)problem_unaware, problem_aware, solution_curious, category_exploring, feature_comparing, option_evaluating, provider_selecting, purchase_ready, purchase_completing, onboarding, optimizing, advocating
journey_directionclassification_journey_directionMovement through journeyentering, progressing, stalled, regressing, exiting
expertise_levelclassification_expertise_levelKnowledge levelnovice, intermediate, advanced, expert
buyer_behaviorclassification_buyer_behaviorPurchase approachimpulsive, methodical, price_sensitive, quality_focused, brand_loyal, research_heavy
role_contextclassification_role_contextWho is searchingindividual_consumer, small_business_owner, enterprise_buyer, developer, marketer, student, professional, hobbyist
demand_patternclassification_demand_patternWhen demand occursevergreen, seasonal_recurring, event_driven, trending_spike, declining
content_decayclassification_content_decayHow quickly content goes staletimeless, slow_decay, moderate_decay, fast_decay, real_time
query_specificityclassification_query_specificityQuery length/specificityhead, torso, long_tail, ultra_long_tail
has_brand_mentionclassification_has_brand_mentionContains brand name0 or 1
brand_mentionedclassification_brand_mentionedThe brand name if detectedFree text

v1 Dimensions (Backwards Compatible)

DimensionColumn NameValid Values
funnel_stageclassification_funnel_stageawareness, consideration, decision, retention, advocacy
intent_typeclassification_intent_typeinformational, navigational, commercial_investigation, transactional, local, support
keyword_patternclassification_keyword_patternquestion, comparison, superlative, problem, use_case, etc.
ranking_difficultyclassification_ranking_difficultyvery_easy, easy, medium, hard, very_hard
trend_velocityclassification_trend_velocityrising_fast, rising_slow, stable, declining, volatile
seasonalityclassification_seasonalityevergreen, seasonal, event_driven
competitive_densityclassification_competitive_densitylow, medium, high, dominated
commercial_value_indexclassification_commercial_value_indexNumeric (volume x CPC)
sentiment_intentclassification_sentiment_intentpositive, neutral, negative, urgent
content_format_intentclassification_content_format_intentlisticle, comparison_review, how_to_guide, video_preferred, etc.

Classification Pipeline Stages

Domain Classification (7 Stages)

StageNameCostWhat It Does
0CacheFREECheck D1 for existing classification
1RulesFREEKnown domains, TLDs (.gov/.edu), platform patterns
1.5Google Ads CategoriesFREEUse cached DFS category data → tier1_type hint
2VectorizeFREESemantic similarity to labeled domains
3Low-Noise CrawlFREEHEAD + partial GET (8KB), CMS/og:type detection
4Instant Pages$0.000125DataForSEO full page fetch
4.5Domain PatternsFREEFallback rules for placeholder pages
5LLM~$0.0001Workers AI for ambiguous cases

Early exit: If confidence >= 70% at stages 3 or 4, skip remaining stages.

Self-learning: LLM classifications with >= 80% confidence are upserted to Vectorize.

URL Classification (5 Stages)

StageNameCostWhat It Does
0Domain CacheFREEUse cached domain classification
1RulesFREEURL patterns, known sites
2VectorizeFREESimilarity to labeled URLs
3Content Parse$0.000125DataForSEO Instant Pages
4LLM~$0.0001Workers AI fallback

Self-learning: LLM classifications with >= 65% confidence are upserted to Vectorize.

Keyword Classification (4 Stages)

StageNameCostCoverage
1RulesFREE~60% of dimensions
2Vectorize~$0.00001~25% of dimensions
3Brand LookupFREE1 dimension
4LLM~$0.0001~15% (fallback)

Self-learning: LLM classifications with >= 80% confidence are upserted to Vectorize.


Source Files

FilePurpose
src/lib/classification-constants.jsAll enum definitions, mappings, helpers
src/data/domain-database.jsGenerated from master CSV (4,400+ domains)
classification-data/domains/_master.csvCurated domain classifications
src/lib/classifier-rules-engine.jsStage 1: Rules engine
src/lib/classifier-vectorize.jsStage 2: Vectorize similarity
src/lib/low-noise-crawler.jsStage 3: FREE homepage metadata extraction
src/lib/classifier-content-parser.jsStage 4: DataForSEO Instant Pages
src/lib/classifier-llm.jsStage 5: Workers AI LLM
src/lib/domain-classifier.jsDomain classification orchestrator
src/lib/backlink-classifier.jsURL classification orchestrator
src/lib/target-url-classifier.jsTarget URL classification (rules only, FREE)
src/lib/keyword-classifier.jsKeyword classification orchestrator

Database Tables

domains

domain TEXT PRIMARY KEY,
tier1_type TEXT,
domain_type TEXT,
ownership_type TEXT,
channel_bucket TEXT,
media_type TEXT,
quality_tier TEXT,
industry TEXT,
subcategory TEXT,
tier1_confidence REAL,
domain_type_confidence REAL,
classification_source TEXT,
classification_confidence INTEGER,
classification_context TEXT,
last_classified_at INTEGER

urls

url TEXT PRIMARY KEY,
domain TEXT,
domain_id INTEGER,
page_type TEXT,
page_category TEXT,
tactic_type TEXT,
is_money_page INTEGER DEFAULT 0,
url_pattern TEXT,
classification_source TEXT,
classification_confidence INTEGER,
last_classified_at INTEGER

keywords

keyword TEXT,
classification_funnel_stage TEXT,
classification_intent_type TEXT,
classification_journey_moment TEXT,
classification_expertise_level TEXT,
classification_buyer_behavior TEXT,
classification_query_specificity TEXT,
classification_has_brand_mention INTEGER,
classification_brand_mentioned TEXT,
classification_source TEXT,
classification_version INTEGER DEFAULT 2,
classification_completed_at INTEGER

Current State Issues

Database vs Master CSV Mismatch

The database contains ~50 domain_type values from LLM classifications using an older taxonomy. The master CSV contains ~25 domain_type values using the V3 taxonomy.

Database domain_types (sample of ~8,000 domains):

  • saas_product (4441) - Valid V3
  • service_business (852) - NOT V3, should be professional_service or agency_provider
  • forum_community_board (413) - NOT V3, should be forum_community
  • blog_content_site (278) - NOT V3, should be blog_publisher or content_publisher
  • entertainment_streaming (235) - NOT V3, should be video_platform or audio_platform
  • telehealth_provider (216) - NOT V3, should be healthcare_provider
  • personal_site_portfolio (175) - NOT V3, no direct equivalent
  • etc.

Master CSV domain_types (sample of ~4,400 domains):

  • saas_product (1639)
  • ecommerce_store (536)
  • government_site (257)
  • travel_booking (256)
  • financial_institution (245)
  • etc.

Fix Required

  1. Create mapping from old LLM types → V3 types
  2. Update database with corrected types
  3. Re-seed Vectorize with V3 types
  4. Update LLM prompts to output V3 types only