Crawl Management & Job Scheduling
Complete guide to managing app store crawls, scheduled jobs, and queue operations.
Overview
The crawl management system handles:
- Scheduled jobs - Recurring cron-based crawls
- Manual crawls - On-demand category/platform crawls
- Queue management - Monitor and control crawl queue
- Job runs - Track history and status of crawls
- Rate limiting - Platform-specific delays and concurrency
Job Types
1. Admin Catalog Refresh
Recurring job that crawls all categories for a platform to maintain fresh catalog data.
Purpose: Keep app catalog up-to-date for search and recommendations
Default Config:
{
"id": "admin_google_play_catalog",
"name": "Google Play Catalog Refresh",
"type": "admin_catalog_refresh",
"cron_schedule": "0 0 * * 0", // Weekly, Sundays midnight UTC
"is_enabled": true,
"config": {
"platform": "google_play",
"all_categories": true,
"related_app_depth": 0,
"chart_types": ["topselling_free"],
"device": "phone"
}
}
Configurable Parameters:
cron_schedule- When to run (cron syntax)is_enabled- Enable/disable without deletingrelated_app_depth- 0 (category only), 1 (+ related), 2 (recursive)chart_types- Which charts to crawldevice- Device type (phone/tablet/iphone/ipad)limit- Optional: limit number of categories (for testing)
2. Customer App Tracking
Daily job that tracks specific apps for paying customers.
Purpose: Monitor app rankings and metadata for customer subscriptions
Default Config:
{
"id": "customer_google_play_tracking",
"name": "Customer: Google Play App Tracking",
"type": "customer_app_tracking",
"cron_schedule": "0 2 * * *", // Daily at 2 AM UTC
"is_enabled": true,
"config": {
"platform": "google_play"
}
}
How it works:
- Queries
tracked_appstable for active customer apps - Crawls each app's category rankings
- Updates
app_category_rankingsandapp_metadata_snapshots - Billing: Per app per day
Managing Jobs
List All Jobs
Endpoint: GET /api/jobs
curl https://your-worker.workers.dev/api/jobs
Response:
{
"jobs": [
{
"id": "admin_google_play_catalog",
"name": "Google Play Catalog Refresh",
"type": "admin_catalog_refresh",
"cron_schedule": "0 0 * * 0",
"is_enabled": true,
"config": {...},
"last_run": "2024-01-14T00:00:00Z",
"next_run": "2024-01-21T00:00:00Z",
"last_run_status": "completed"
}
]
}
Update Job Configuration
Endpoint: PUT /api/jobs/{jobId}
curl -X PUT https://your-worker.workers.dev/api/jobs/admin_google_play_catalog \
-H "Content-Type: application/json" \
-d '{
"cron_schedule": "0 2 * * *",
"is_enabled": true,
"config": {
"related_app_depth": 1
}
}'
Partial updates supported - Only include fields you want to change.
Response:
{
"success": true,
"job": {
"id": "admin_google_play_catalog",
"cron_schedule": "0 2 * * *",
"config": {
"platform": "google_play",
"all_categories": true,
"related_app_depth": 1,
"chart_types": ["topselling_free"],
"device": "phone"
},
"updated_at": 1705276800000
}
}
Trigger Job Manually
Run a job immediately without waiting for cron schedule.
Endpoint: POST /api/jobs/{jobId}/run
curl -X POST https://your-worker.workers.dev/api/jobs/admin_google_play_catalog/run
Response:
{
"success": true,
"run_id": "run_abc123",
"queued_categories": 48,
"estimated_duration": "30-45 minutes"
}
View Job History
Endpoint: GET /api/jobs/{jobId}/runs
curl https://your-worker.workers.dev/api/jobs/admin_google_play_catalog/runs
Response:
{
"job_id": "admin_google_play_catalog",
"runs": [
{
"id": "run_abc123",
"job_id": "admin_google_play_catalog",
"status": "completed",
"started_at": "2024-01-14T00:00:00Z",
"completed_at": "2024-01-14T00:42:15Z",
"metadata": {
"categories_processed": 48,
"apps_discovered": 1440,
"errors": 0
}
}
]
}
View Job Queue
See what's currently queued for a job.
Endpoint: GET /api/jobs/{jobId}/queue
curl https://your-worker.workers.dev/api/jobs/admin_google_play_catalog/queue
Response:
{
"job_id": "admin_google_play_catalog",
"queue_items": [
{
"id": "queue_item_123",
"category_id": "SOCIAL",
"status": "processing",
"created_at": "2024-01-14T00:00:00Z",
"started_at": "2024-01-14T00:01:00Z"
},
{
"id": "queue_item_124",
"category_id": "PRODUCTIVITY",
"status": "pending",
"created_at": "2024-01-14T00:00:01Z"
}
]
}
Manual Crawls
Start One-Off Crawl
Trigger an immediate crawl without creating a scheduled job.
Endpoint: POST /api/admin/start-crawl
curl -X POST https://your-worker.workers.dev/api/admin/start-crawl \
-H "Content-Type: application/json" \
-d '{
"platform": "google_play",
"chart_types": ["topselling_free"],
"device": "phone",
"limit": 5,
"related_app_depth": 0
}'
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
platform | string | Yes | apple or google_play |
chart_types | array | No | Google Play only: ["topselling_free", "topselling_paid", "topgrossing"] |
device | string | No | phone, tablet, iphone, ipad |
limit | number | No | Limit number of categories (for testing) |
category_ids | array | No | Specific categories to crawl |
related_app_depth | number | No | 0 (default), 1, or 2 |
Response:
{
"success": true,
"queued": 5,
"categories": [
{"id": "SOCIAL", "name": "Social"},
{"id": "PRODUCTIVITY", "name": "Productivity"},
{"id": "COMMUNICATION", "name": "Communication"},
{"id": "ENTERTAINMENT", "name": "Entertainment"},
{"id": "TOOLS", "name": "Tools"}
],
"estimated_apps": 150,
"estimated_duration": "5-10 minutes"
}
Crawl Specific Categories
curl -X POST https://your-worker.workers.dev/api/admin/start-crawl \
-H "Content-Type: application/json" \
-d '{
"platform": "google_play",
"category_ids": ["SOCIAL", "PRODUCTIVITY"],
"chart_types": ["topselling_free"],
"related_app_depth": 1
}'
Test Crawl (Single Category)
Quick test with minimal apps.
curl -X POST https://your-worker.workers.dev/api/admin/start-crawl \
-H "Content-Type: application/json" \
-d '{
"platform": "apple",
"device": "iphone",
"limit": 1,
"related_app_depth": 0
}'
Queue Status & Monitoring
Overall Queue Status
Endpoint: GET /api/admin/queue-status
curl https://your-worker.workers.dev/api/admin/queue-status
Response:
{
"last_24h": {
"total_jobs": 148,
"by_platform": {
"google_play": 48,
"apple": 100
},
"by_category": {
"SOCIAL": 1,
"apple_grouping_25188": 1
},
"apps_discovered": 3360,
"apps_updated": 3360,
"errors": 0
},
"queue": {
"pending": 0,
"processing": 3,
"completed": 145,
"failed": 0
},
"rate_limiting": {
"apple_delay_seconds": 4,
"google_play_delay_seconds": 8,
"current_concurrency": 3
}
}
Real-Time Queue Metrics
View current queue depth and processing rate.
Endpoint: GET /api/admin/queue-metrics
curl https://your-worker.workers.dev/api/admin/queue-metrics
Response:
{
"main_queue": {
"name": "rankfabric-tasks",
"pending": 12,
"processing": 3,
"completed_last_hour": 48,
"failed_last_hour": 0,
"avg_processing_time_ms": 8500
},
"app_details_queue": {
"name": "app-details-fetch",
"pending": 124,
"processing": 3,
"completed_last_hour": 67,
"failed_last_hour": 1
},
"clickhouse_queue": {
"name": "clickhouse-ingestion",
"pending": 0,
"processing": 0,
"completed_last_hour": 145
}
}
Related App Depth Explained
Controls how deep the crawler goes beyond category pages.
Depth 0: Category Pages Only (SAFE)
What it does:
- Crawls only the category listing page
- Gets ~30 top apps per category
Volume:
- Google Play: 48 categories × 30 apps = 1,440 apps
- Apple: 64 categories × 30 apps = 1,920 apps
- Total: ~3,360 apps
Duration: 30-45 minutes
Use case: Regular catalog refresh, low resource usage
Depth 1: Category + Related Apps
What it does:
- Crawls category listing page
- For each app, crawls its detail page
- Discovers apps in "Similar Apps" section
Volume:
- Google Play: 48 categories × ~1,000 apps = 48,000 apps
- Apple: 64 categories × ~300 apps = 19,200 apps
- Total: ~67,000 apps
Duration: 6-8 hours
Use case: Initial catalog build, competitor discovery
Depth 2: Recursive Discovery (DANGEROUS)
What it does:
- Crawls category → apps → related apps → related apps...
- Continues until no new apps found
Volume:
- 1,000,000+ apps (entire app store)
Duration: Days
Use case: Full catalog extraction (not recommended without distributed crawling)
⚠️ Warning: Can exhaust Worker CPU limits, trigger rate limiting, and cost significant money.
Rate Limiting & Performance
Platform-Specific Delays
Apple App Store:
- Delay: 4 seconds between requests
- Max concurrency: 3
- Reasoning: Aggressive scraping detection
Google Play:
- Delay: 8 seconds between requests
- Max concurrency: 3
- Reasoning: Very aggressive anti-bot measures
Queue Configuration
# Main task queue
[[queues.consumers]]
queue = "rankfabric-tasks"
max_batch_size = 1
max_batch_timeout = 30
max_retries = 3
max_concurrency = 20 # For SERP tracking; app crawls self-limit
# App details queue (separate, slower)
[[queues.consumers]]
queue = "app-details-fetch"
max_batch_size = 5
max_batch_timeout = 30
max_retries = 2
max_concurrency = 3 # Slow to avoid bans
Performance Tuning
To speed up crawls:
- Increase
max_concurrency(but watch for bans) - Decrease rate limit delays (risky)
- Use proxy service (Oxylabs)
To reduce resource usage:
- Decrease
related_app_depth - Use
limitparameter to crawl fewer categories - Reduce cron frequency
Proxy Configuration (Optional)
Add Oxylabs proxy to reduce ban risk and speed up crawls.
Setup
wrangler secret put OXYLABS_USERNAME
wrangler secret put OXYLABS_PASSWORD
Automatic Usage
Worker automatically uses proxy if credentials are present:
// In crawler code
const useProxy = env.OXYLABS_USERNAME && env.OXYLABS_PASSWORD;
if (useProxy) {
// Route through Oxylabs residential proxies
// Reduces ban risk, allows higher concurrency
}
Benefits
- Lower ban risk (rotating IPs)
- Higher concurrency (up to 10)
- Better reliability
Costs
- Oxylabs pricing: ~$500-2000/month depending on volume
Cron Schedule Examples
Run Daily
{
"cron_schedule": "0 2 * * *" // 2 AM UTC every day
}
Run Weekly
{
"cron_schedule": "0 0 * * 0" // Midnight UTC every Sunday
}
Run Every 6 Hours
{
"cron_schedule": "0 */6 * * *" // Every 6 hours
}
Run Monthly
{
"cron_schedule": "0 0 1 * *" // Midnight UTC on 1st of month
}
Disable Cron (Manual Only)
{
"cron_schedule": null,
"is_enabled": false
}
Customer App Tracking
Track specific apps for paying customers.
Setup Customer Subscription
INSERT INTO customer_subscriptions (
id, customer_id, project_id, subscription_type, is_active, started_at, created_at
) VALUES (
'sub_123',
'cust_456',
'proj_789',
'app_tracking',
1,
unixepoch(),
unixepoch()
);
Add Apps to Track
INSERT INTO tracked_apps (
id, project_id, app_id, platform, is_active, created_at
) VALUES (
'track_001',
'proj_789',
'com.facebook.katana',
'google_play',
1,
unixepoch()
);
How Daily Tracking Works
- Cron triggers at 2 AM UTC (customer timezone configurable)
- Query active subscriptions:
SELECT DISTINCT ta.* FROM tracked_apps ta
JOIN customer_subscriptions cs ON ta.project_id = cs.project_id
WHERE cs.subscription_type = 'app_tracking'
AND cs.is_active = 1
AND ta.is_active = 1 - For each app:
- Crawl app's category rankings
- Update
app_category_rankings - Snapshot metadata to
app_metadata_snapshots
- Billing: Count as 1 unit per app per day
Troubleshooting
Job Not Running
Check:
- Job is enabled:
GET /api/jobs/{jobId} - Cron schedule is valid
- Worker cron trigger is configured in
wrangler.toml
Verify cron execution:
wrangler tail --format pretty | grep "cron"
Crawl Stuck in Queue
Symptoms:
- Queue items show "processing" for hours
queue_statusshows high pending count
Causes:
- Rate limiting delays (expected for large crawls)
- App store blocking requests
- Worker timeout (30s limit)
Check queue consumer logs:
wrangler tail --format pretty | grep "crawl_category"
Solutions:
- Wait (rate limiting is working as intended)
- Reduce concurrency if getting blocked
- Add proxy credentials to avoid bans
Apps Not Appearing in Catalog
Check database:
wrangler d1 execute rankfabric_db --command \
"SELECT COUNT(*) FROM apps WHERE platform = 'google_play'"
Check recent rankings:
wrangler d1 execute rankfabric_db --command \
"SELECT category_id, COUNT(*) as apps FROM app_category_rankings
WHERE scraped_at > unixepoch() - 86400
GROUP BY category_id"
Solutions:
- Re-run crawl with
related_app_depth: 1 - Check if category IDs are correct
- Verify app store URLs are accessible
High Error Rate
Check dead letter queue:
# View DLQ in Cloudflare dashboard
# Or query via wrangler
wrangler queues consumer rankfabric-tasks --dlq
Common errors:
- 429 Rate Limit (reduce concurrency)
- Timeout (decrease batch size)
- Parse errors (app store HTML changed)
Fix:
- Check error logs for specific failure
- Adjust rate limits/concurrency
- Update parser if HTML structure changed
- Add proxy if getting blocked
Best Practices
Regular Catalog Refresh
Recommended:
- Weekly crawl with
depth: 0for all categories - Monthly crawl with
depth: 1for deep discovery
Schedule:
{
"weekly_refresh": {
"cron_schedule": "0 0 * * 0",
"config": {"related_app_depth": 0}
},
"monthly_deep_crawl": {
"cron_schedule": "0 0 1 * *",
"config": {"related_app_depth": 1}
}
}
Testing New Categories
Before adding to recurring job:
# Test single category first
curl -X POST /api/admin/start-crawl \
-d '{"platform": "google_play", "category_ids": ["NEW_CATEGORY"]}'
# Verify results
curl /api/app-store/top-apps?category_id=NEW_CATEGORY&platform=google_play
# Add to job config once validated
curl -X PUT /api/jobs/admin_google_play_catalog \
-d '{"config": {"category_ids": [...include NEW_CATEGORY]}}'
Monitoring Health
Daily checks:
- Queue status:
GET /api/admin/queue-status - Failed jobs: Check DLQ count
- App count: Verify growth in
appstable
Weekly checks:
- Job history: Review completion times
- Error patterns: Identify recurring failures
- Coverage: Ensure all categories crawled
Setup alerts:
- DLQ count > 10
- Queue pending > 100 for >1 hour
- Job completion time > 2x average
Advanced: Custom Job Types
Create custom job types for specific use cases.
Example: Trending Apps Discovery
INSERT INTO jobs (id, name, type, cron_schedule, is_enabled, config, created_at, updated_at)
VALUES (
'trending_apps_daily',
'Trending Apps Discovery',
'custom_trending_discovery',
'0 12 * * *',
1,
json('{"lookback_days": 7, "min_rank_change": 10}'),
unixepoch(),
unixepoch()
);
Implement handler in src/lib/cron-jobs.js:
case 'custom_trending_discovery':
// Find apps that jumped 10+ positions in last 7 days
// Enqueue deep crawl for those apps
break;
Summary
For regular catalog maintenance:
- Use scheduled jobs with
depth: 0 - Run weekly or daily depending on freshness needs
For discovery and backfill:
- Use manual crawls with
depth: 1 - Limit to specific categories or use
limitparameter
For customer tracking:
- Set up
tracked_appsentries - Enable customer app tracking job
- Bill per app per day
Monitor regularly:
- Check queue status
- Review job history
- Watch for errors in DLQ
Remember:
- Start with
depth: 0(safe, fast) - Add proxy for high-volume crawling
- Rate limits exist for a reason (avoid bans)