App Store Crawler System
Comprehensive Apple App Store and Google Play catalog crawling, rankings tracking, and app recommendations.
Overview
Automated crawler for app store data. Tracks rankings, collects app metadata, similar apps (recommendations), and maintains a searchable app catalog.
Key Features:
- Apple App Store: Groupings, Rooms, Stories, Charts (573 categories)
- Google Play: Category charts via DataForSEO (373 categories)
- Rankings tracking: Historical position data per category
- Recommendations: Similar apps network for ranking apps
- App metadata: Ratings, reviews, developers, descriptions, release dates
- Weekly cron: Automated Monday crawls at 2 AM (Apple) and 3 AM (Google Play) UTC
Architecture
Crawl Flow
┌─────────────────────────────────────────────────────────────────┐
│ CRON TRIGGER │
│ Monday 2 AM UTC (Apple) / 3 AM UTC (Google) │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ taxonomy-crawl.js │
│ Queues all categories to rankfabric-tasks-v2 │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ app-crawl-consumer.js │
│ Processes crawl_category messages, fetches rankings │
│ Queues apps to app-info-fetch with crawl_depth: 1 │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ app-details-consumer.js │
│ Fetches full app details (iTunes API + HTML scrape) │
│ IF crawl_depth > 0: saves recommendations, queues │
│ similar apps with crawl_depth: 0 │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ shelf-crawl-consumer.js │
│ Deep crawls room/story URLs for complete app lists │
│ Queues apps to app-info-fetch with crawl_depth: 1 │
└─────────────────────────────────────────────────────────────────┘
crawl_depth Parameter
Controls whether an app's similar apps are processed:
| Value | Meaning | Recommendation Processing |
|---|---|---|
1 | Ranking app (from charts/shelves) | YES - save to app_recommendations, queue similar apps |
0 | Similar app (queued from recommendations) | NO - just fetch details, don't chain |
This prevents infinite crawl chains.
Database Tables
Core Tables
| Table | Purpose |
|---|---|
apps | App metadata (title, developer, rating, description, etc.) |
app_store_categories | Category taxonomy (groupings, rooms, charts, stories) |
app_category_rankings | App rankings per category per week |
app_recommendations | Similar apps relationships (source_app → recommended_app) |
crawl_job_runs | Crawl run tracking and progress |
crawl_category_status | Per-category crawl status |
Key Fields in app_category_rankings
category_id- Internal category IDapp_id- App ID (Apple numeric or Google Play package)app_rank- Position in categoryyear_week- YYYYWW format for weekly snapshotsparent_category_id- Parent grouping for filteringcategory_type- grouping, room, chart, story, herocategory_section- apps, games, arcade
API Endpoints
Client-Facing Endpoints
Get Top Apps
GET /api/top-apps?platform=apple&limit=100
GET /api/top-apps?platform=apple&category_id=143&limit=50
GET /api/top-apps?platform=google_play§ion=games
Query Parameters:
platform-appleorgoogle_play(required)limit- Number of apps (default 100)category_id- Filter by category internal IDsection-apps,games, orarcadecontent_type- Alias for section
Get Categories
GET /api/app-store/categories?platform=apple&content_type=apps
GET /api/app-store/categories?platform=google_play&content_type=games
Query Parameters:
platform-appleorgoogle_play(required)content_type-apps,games, orarcadeactive_only-trueto only show categories with rankings
Get Child Categories (Subcategories)
GET /api/app-store/child-categories?parent_id=143&platform=apple
App Lookup
GET /api/app-store/lookup?app_id=333903271&platform=apple
Returns full app details including recommendations.
Admin/Crawl Endpoints
Trigger Full Taxonomy Crawl
POST /api/admin/taxonomy-crawl
Body: {"platform": "apple"}
Queues ALL categories for the platform. Used by cron.
Apple Grouping Scraper (Direct)
POST /api/app-store/grouping
Body: {
"grouping_id": "25188",
"save_rankings": true,
"device": "iphone"
}
Scrapes a single Apple grouping page. Use for testing.
Queue Status
GET /api/admin/queue-status
Shows recent crawl activity, category counts, rankings by platform.
Crawl Runs
GET /api/admin/crawl-runs?limit=20
Shows crawl run history with progress.
Cron Configuration
Defined in wrangler.toml:
[triggers]
crons = [
"0 2 * * 1", # Apple crawl: Monday 2 AM UTC
"0 3 * * 1", # Google Play crawl: Monday 3 AM UTC
"0 2 * * *", # Daily subscriptions at 2 AM UTC
"0 10 * * *" # SERP tracking at 10 AM UTC (2 AM PST)
]
Cron Handler (src/index.js)
async scheduled(event, env, ctx) {
const hour = now.getUTCHours();
const dayOfWeek = now.getUTCDay(); // 0=Sun, 1=Mon
if (hour === 2 && dayOfWeek === 1) {
// Monday 2 AM UTC: Apple crawl
await handleTaxonomyCrawl(env, "apple");
}
if (hour === 3 && dayOfWeek === 1) {
// Monday 3 AM UTC: Google Play crawl
await handleTaxonomyCrawl(env, "google_play");
}
}
Queue Configuration
Queues (wrangler.toml)
| Queue | Purpose | Concurrency |
|---|---|---|
rankfabric-tasks-v2 | Category crawl tasks | 20 |
app-info-fetch | App detail fetching | 10 |
shelf-deep-crawl | Room/story deep crawls | 10 |
DRAIN_MODE
Emergency stop for runaway crawls:
// In src/index.js queue handler
const DRAIN_MODE = false; // Set to true to purge queue
When true, all queue messages are ACKed without processing.
Category Types
Apple
| Type | Count | Description |
|---|---|---|
grouping | ~50 | Main navigation categories (Health & Fitness, Productivity) |
room | ~300 | Curated collections ("Best Running Apps") |
chart | ~80 | Top Free, Top Paid per genre |
story | ~60 | Editorial feature articles |
hero | ~50 | Featured carousel slots |
Google Play
| Type | Count | Description |
|---|---|---|
grouping | ~50 | Main categories |
chart | ~320 | Charts per category (topselling_free, topselling_paid, topgrossing) |
Data Sources
Apple
- Direct HTML scrape - Grouping pages (free)
- ZenRows - Room pages when blocked ($0.001/request)
- RSS Feed - Chart rankings (free, up to 200 apps)
- iTunes API - App metadata (free)
Google Play
- DataForSEO - App lists and details ($0.02-0.05/request)
Recommendations System
How It Works
- Ranking apps (from charts/shelves) are queued with
crawl_depth: 1 app-details-consumer.jsfetches app details includingsimilar_apps- If
crawl_depth > 0:- Save each similar app to
app_recommendationstable - Queue similar apps with
crawl_depth: 0for their details
- Save each similar app to
- Similar apps get fetched but DON'T process their own similar apps
app_recommendations Table
CREATE TABLE app_recommendations (
id TEXT PRIMARY KEY,
source_app_id TEXT, -- The ranking app
recommended_app_id TEXT, -- The similar app
platform TEXT,
section TEXT, -- 'similar'
position INTEGER,
first_seen INTEGER,
last_seen INTEGER,
times_seen INTEGER,
is_active INTEGER
);
Troubleshooting
Crawl Not Running
- Check DRAIN_MODE is
falsein src/index.js - Check cron schedule in wrangler.toml
- Check
GET /api/admin/crawl-runsfor recent runs
Missing Rankings
- Check category exists:
GET /api/app-store/categories?platform=apple - Check crawl status:
GET /api/admin/queue-status - Verify app exists in
appstable
Recommendations Empty
- Verify
crawl_depth: 1is being passed when queuing ranking apps - Check
app-details-consumer.jslogs for "Saved X recommendations" - Query:
SELECT COUNT(*) FROM app_recommendations