Skip to main content

App Store Crawler System

Comprehensive Apple App Store and Google Play catalog crawling, rankings tracking, and app recommendations.


Overview

Automated crawler for app store data. Tracks rankings, collects app metadata, similar apps (recommendations), and maintains a searchable app catalog.

Key Features:

  • Apple App Store: Groupings, Rooms, Stories, Charts (573 categories)
  • Google Play: Category charts via DataForSEO (373 categories)
  • Rankings tracking: Historical position data per category
  • Recommendations: Similar apps network for ranking apps
  • App metadata: Ratings, reviews, developers, descriptions, release dates
  • Weekly cron: Automated Monday crawls at 2 AM (Apple) and 3 AM (Google Play) UTC

Architecture

Crawl Flow

┌─────────────────────────────────────────────────────────────────┐
│ CRON TRIGGER │
│ Monday 2 AM UTC (Apple) / 3 AM UTC (Google) │
└─────────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────┐
│ taxonomy-crawl.js │
│ Queues all categories to rankfabric-tasks-v2 │
└─────────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────┐
│ app-crawl-consumer.js │
│ Processes crawl_category messages, fetches rankings │
│ Queues apps to app-info-fetch with crawl_depth: 1 │
└─────────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────┐
│ app-details-consumer.js │
│ Fetches full app details (iTunes API + HTML scrape) │
│ IF crawl_depth > 0: saves recommendations, queues │
│ similar apps with crawl_depth: 0 │
└─────────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────┐
│ shelf-crawl-consumer.js │
│ Deep crawls room/story URLs for complete app lists │
│ Queues apps to app-info-fetch with crawl_depth: 1 │
└─────────────────────────────────────────────────────────────────┘

crawl_depth Parameter

Controls whether an app's similar apps are processed:

ValueMeaningRecommendation Processing
1Ranking app (from charts/shelves)YES - save to app_recommendations, queue similar apps
0Similar app (queued from recommendations)NO - just fetch details, don't chain

This prevents infinite crawl chains.


Database Tables

Core Tables

TablePurpose
appsApp metadata (title, developer, rating, description, etc.)
app_store_categoriesCategory taxonomy (groupings, rooms, charts, stories)
app_category_rankingsApp rankings per category per week
app_recommendationsSimilar apps relationships (source_app → recommended_app)
crawl_job_runsCrawl run tracking and progress
crawl_category_statusPer-category crawl status

Key Fields in app_category_rankings

  • category_id - Internal category ID
  • app_id - App ID (Apple numeric or Google Play package)
  • app_rank - Position in category
  • year_week - YYYYWW format for weekly snapshots
  • parent_category_id - Parent grouping for filtering
  • category_type - grouping, room, chart, story, hero
  • category_section - apps, games, arcade

API Endpoints

Client-Facing Endpoints

Get Top Apps

GET /api/top-apps?platform=apple&limit=100
GET /api/top-apps?platform=apple&category_id=143&limit=50
GET /api/top-apps?platform=google_play&section=games

Query Parameters:

  • platform - apple or google_play (required)
  • limit - Number of apps (default 100)
  • category_id - Filter by category internal ID
  • section - apps, games, or arcade
  • content_type - Alias for section

Get Categories

GET /api/app-store/categories?platform=apple&content_type=apps
GET /api/app-store/categories?platform=google_play&content_type=games

Query Parameters:

  • platform - apple or google_play (required)
  • content_type - apps, games, or arcade
  • active_only - true to only show categories with rankings

Get Child Categories (Subcategories)

GET /api/app-store/child-categories?parent_id=143&platform=apple

App Lookup

GET /api/app-store/lookup?app_id=333903271&platform=apple

Returns full app details including recommendations.


Admin/Crawl Endpoints

Trigger Full Taxonomy Crawl

POST /api/admin/taxonomy-crawl
Body: {"platform": "apple"}

Queues ALL categories for the platform. Used by cron.

Apple Grouping Scraper (Direct)

POST /api/app-store/grouping
Body: {
"grouping_id": "25188",
"save_rankings": true,
"device": "iphone"
}

Scrapes a single Apple grouping page. Use for testing.

Queue Status

GET /api/admin/queue-status

Shows recent crawl activity, category counts, rankings by platform.

Crawl Runs

GET /api/admin/crawl-runs?limit=20

Shows crawl run history with progress.


Cron Configuration

Defined in wrangler.toml:

[triggers]
crons = [
"0 2 * * 1", # Apple crawl: Monday 2 AM UTC
"0 3 * * 1", # Google Play crawl: Monday 3 AM UTC
"0 2 * * *", # Daily subscriptions at 2 AM UTC
"0 10 * * *" # SERP tracking at 10 AM UTC (2 AM PST)
]

Cron Handler (src/index.js)

async scheduled(event, env, ctx) {
const hour = now.getUTCHours();
const dayOfWeek = now.getUTCDay(); // 0=Sun, 1=Mon

if (hour === 2 && dayOfWeek === 1) {
// Monday 2 AM UTC: Apple crawl
await handleTaxonomyCrawl(env, "apple");
}

if (hour === 3 && dayOfWeek === 1) {
// Monday 3 AM UTC: Google Play crawl
await handleTaxonomyCrawl(env, "google_play");
}
}

Queue Configuration

Queues (wrangler.toml)

QueuePurposeConcurrency
rankfabric-tasks-v2Category crawl tasks20
app-info-fetchApp detail fetching10
shelf-deep-crawlRoom/story deep crawls10

DRAIN_MODE

Emergency stop for runaway crawls:

// In src/index.js queue handler
const DRAIN_MODE = false; // Set to true to purge queue

When true, all queue messages are ACKed without processing.


Category Types

Apple

TypeCountDescription
grouping~50Main navigation categories (Health & Fitness, Productivity)
room~300Curated collections ("Best Running Apps")
chart~80Top Free, Top Paid per genre
story~60Editorial feature articles
hero~50Featured carousel slots

Google Play

TypeCountDescription
grouping~50Main categories
chart~320Charts per category (topselling_free, topselling_paid, topgrossing)

Data Sources

Apple

  1. Direct HTML scrape - Grouping pages (free)
  2. ZenRows - Room pages when blocked ($0.001/request)
  3. RSS Feed - Chart rankings (free, up to 200 apps)
  4. iTunes API - App metadata (free)

Google Play

  1. DataForSEO - App lists and details ($0.02-0.05/request)

Recommendations System

How It Works

  1. Ranking apps (from charts/shelves) are queued with crawl_depth: 1
  2. app-details-consumer.js fetches app details including similar_apps
  3. If crawl_depth > 0:
    • Save each similar app to app_recommendations table
    • Queue similar apps with crawl_depth: 0 for their details
  4. Similar apps get fetched but DON'T process their own similar apps

app_recommendations Table

CREATE TABLE app_recommendations (
id TEXT PRIMARY KEY,
source_app_id TEXT, -- The ranking app
recommended_app_id TEXT, -- The similar app
platform TEXT,
section TEXT, -- 'similar'
position INTEGER,
first_seen INTEGER,
last_seen INTEGER,
times_seen INTEGER,
is_active INTEGER
);

Troubleshooting

Crawl Not Running

  1. Check DRAIN_MODE is false in src/index.js
  2. Check cron schedule in wrangler.toml
  3. Check GET /api/admin/crawl-runs for recent runs

Missing Rankings

  1. Check category exists: GET /api/app-store/categories?platform=apple
  2. Check crawl status: GET /api/admin/queue-status
  3. Verify app exists in apps table

Recommendations Empty

  1. Verify crawl_depth: 1 is being passed when queuing ranking apps
  2. Check app-details-consumer.js logs for "Saved X recommendations"
  3. Query: SELECT COUNT(*) FROM app_recommendations