Skip to main content

Crawl Management & Job Scheduling

Complete guide to managing app store crawls, scheduled jobs, and queue operations.


Overview

The crawl management system handles:

  • Scheduled jobs - Recurring cron-based crawls
  • Manual crawls - On-demand category/platform crawls
  • Queue management - Monitor and control crawl queue
  • Job runs - Track history and status of crawls
  • Rate limiting - Platform-specific delays and concurrency

Job Types

1. Admin Catalog Refresh

Recurring job that crawls all categories for a platform to maintain fresh catalog data.

Purpose: Keep app catalog up-to-date for search and recommendations

Default Config:

{
"id": "admin_google_play_catalog",
"name": "Google Play Catalog Refresh",
"type": "admin_catalog_refresh",
"cron_schedule": "0 0 * * 0", // Weekly, Sundays midnight UTC
"is_enabled": true,
"config": {
"platform": "google_play",
"all_categories": true,
"related_app_depth": 0,
"chart_types": ["topselling_free"],
"device": "phone"
}
}

Configurable Parameters:

  • cron_schedule - When to run (cron syntax)
  • is_enabled - Enable/disable without deleting
  • related_app_depth - 0 (category only), 1 (+ related), 2 (recursive)
  • chart_types - Which charts to crawl
  • device - Device type (phone/tablet/iphone/ipad)
  • limit - Optional: limit number of categories (for testing)

2. Customer App Tracking

Daily job that tracks specific apps for paying customers.

Purpose: Monitor app rankings and metadata for customer subscriptions

Default Config:

{
"id": "customer_google_play_tracking",
"name": "Customer: Google Play App Tracking",
"type": "customer_app_tracking",
"cron_schedule": "0 2 * * *", // Daily at 2 AM UTC
"is_enabled": true,
"config": {
"platform": "google_play"
}
}

How it works:

  1. Queries tracked_apps table for active customer apps
  2. Crawls each app's category rankings
  3. Updates app_category_rankings and app_metadata_snapshots
  4. Billing: Per app per day

Managing Jobs

List All Jobs

Endpoint: GET /api/jobs

curl https://your-worker.workers.dev/api/jobs

Response:

{
"jobs": [
{
"id": "admin_google_play_catalog",
"name": "Google Play Catalog Refresh",
"type": "admin_catalog_refresh",
"cron_schedule": "0 0 * * 0",
"is_enabled": true,
"config": {...},
"last_run": "2024-01-14T00:00:00Z",
"next_run": "2024-01-21T00:00:00Z",
"last_run_status": "completed"
}
]
}

Update Job Configuration

Endpoint: PUT /api/jobs/{jobId}

curl -X PUT https://your-worker.workers.dev/api/jobs/admin_google_play_catalog \
-H "Content-Type: application/json" \
-d '{
"cron_schedule": "0 2 * * *",
"is_enabled": true,
"config": {
"related_app_depth": 1
}
}'

Partial updates supported - Only include fields you want to change.

Response:

{
"success": true,
"job": {
"id": "admin_google_play_catalog",
"cron_schedule": "0 2 * * *",
"config": {
"platform": "google_play",
"all_categories": true,
"related_app_depth": 1,
"chart_types": ["topselling_free"],
"device": "phone"
},
"updated_at": 1705276800000
}
}

Trigger Job Manually

Run a job immediately without waiting for cron schedule.

Endpoint: POST /api/jobs/{jobId}/run

curl -X POST https://your-worker.workers.dev/api/jobs/admin_google_play_catalog/run

Response:

{
"success": true,
"run_id": "run_abc123",
"queued_categories": 48,
"estimated_duration": "30-45 minutes"
}

View Job History

Endpoint: GET /api/jobs/{jobId}/runs

curl https://your-worker.workers.dev/api/jobs/admin_google_play_catalog/runs

Response:

{
"job_id": "admin_google_play_catalog",
"runs": [
{
"id": "run_abc123",
"job_id": "admin_google_play_catalog",
"status": "completed",
"started_at": "2024-01-14T00:00:00Z",
"completed_at": "2024-01-14T00:42:15Z",
"metadata": {
"categories_processed": 48,
"apps_discovered": 1440,
"errors": 0
}
}
]
}

View Job Queue

See what's currently queued for a job.

Endpoint: GET /api/jobs/{jobId}/queue

curl https://your-worker.workers.dev/api/jobs/admin_google_play_catalog/queue

Response:

{
"job_id": "admin_google_play_catalog",
"queue_items": [
{
"id": "queue_item_123",
"category_id": "SOCIAL",
"status": "processing",
"created_at": "2024-01-14T00:00:00Z",
"started_at": "2024-01-14T00:01:00Z"
},
{
"id": "queue_item_124",
"category_id": "PRODUCTIVITY",
"status": "pending",
"created_at": "2024-01-14T00:00:01Z"
}
]
}

Manual Crawls

Start One-Off Crawl

Trigger an immediate crawl without creating a scheduled job.

Endpoint: POST /api/admin/start-crawl

curl -X POST https://your-worker.workers.dev/api/admin/start-crawl \
-H "Content-Type: application/json" \
-d '{
"platform": "google_play",
"chart_types": ["topselling_free"],
"device": "phone",
"limit": 5,
"related_app_depth": 0
}'

Parameters:

ParameterTypeRequiredDescription
platformstringYesapple or google_play
chart_typesarrayNoGoogle Play only: ["topselling_free", "topselling_paid", "topgrossing"]
devicestringNophone, tablet, iphone, ipad
limitnumberNoLimit number of categories (for testing)
category_idsarrayNoSpecific categories to crawl
related_app_depthnumberNo0 (default), 1, or 2

Response:

{
"success": true,
"queued": 5,
"categories": [
{"id": "SOCIAL", "name": "Social"},
{"id": "PRODUCTIVITY", "name": "Productivity"},
{"id": "COMMUNICATION", "name": "Communication"},
{"id": "ENTERTAINMENT", "name": "Entertainment"},
{"id": "TOOLS", "name": "Tools"}
],
"estimated_apps": 150,
"estimated_duration": "5-10 minutes"
}

Crawl Specific Categories

curl -X POST https://your-worker.workers.dev/api/admin/start-crawl \
-H "Content-Type: application/json" \
-d '{
"platform": "google_play",
"category_ids": ["SOCIAL", "PRODUCTIVITY"],
"chart_types": ["topselling_free"],
"related_app_depth": 1
}'

Test Crawl (Single Category)

Quick test with minimal apps.

curl -X POST https://your-worker.workers.dev/api/admin/start-crawl \
-H "Content-Type: application/json" \
-d '{
"platform": "apple",
"device": "iphone",
"limit": 1,
"related_app_depth": 0
}'

Queue Status & Monitoring

Overall Queue Status

Endpoint: GET /api/admin/queue-status

curl https://your-worker.workers.dev/api/admin/queue-status

Response:

{
"last_24h": {
"total_jobs": 148,
"by_platform": {
"google_play": 48,
"apple": 100
},
"by_category": {
"SOCIAL": 1,
"apple_grouping_25188": 1
},
"apps_discovered": 3360,
"apps_updated": 3360,
"errors": 0
},
"queue": {
"pending": 0,
"processing": 3,
"completed": 145,
"failed": 0
},
"rate_limiting": {
"apple_delay_seconds": 4,
"google_play_delay_seconds": 8,
"current_concurrency": 3
}
}

Real-Time Queue Metrics

View current queue depth and processing rate.

Endpoint: GET /api/admin/queue-metrics

curl https://your-worker.workers.dev/api/admin/queue-metrics

Response:

{
"main_queue": {
"name": "rankfabric-tasks",
"pending": 12,
"processing": 3,
"completed_last_hour": 48,
"failed_last_hour": 0,
"avg_processing_time_ms": 8500
},
"app_details_queue": {
"name": "app-details-fetch",
"pending": 124,
"processing": 3,
"completed_last_hour": 67,
"failed_last_hour": 1
},
"clickhouse_queue": {
"name": "clickhouse-ingestion",
"pending": 0,
"processing": 0,
"completed_last_hour": 145
}
}

Controls how deep the crawler goes beyond category pages.

Depth 0: Category Pages Only (SAFE)

What it does:

  • Crawls only the category listing page
  • Gets ~30 top apps per category

Volume:

  • Google Play: 48 categories × 30 apps = 1,440 apps
  • Apple: 64 categories × 30 apps = 1,920 apps
  • Total: ~3,360 apps

Duration: 30-45 minutes

Use case: Regular catalog refresh, low resource usage


What it does:

  • Crawls category listing page
  • For each app, crawls its detail page
  • Discovers apps in "Similar Apps" section

Volume:

  • Google Play: 48 categories × ~1,000 apps = 48,000 apps
  • Apple: 64 categories × ~300 apps = 19,200 apps
  • Total: ~67,000 apps

Duration: 6-8 hours

Use case: Initial catalog build, competitor discovery


Depth 2: Recursive Discovery (DANGEROUS)

What it does:

  • Crawls category → apps → related apps → related apps...
  • Continues until no new apps found

Volume:

  • 1,000,000+ apps (entire app store)

Duration: Days

Use case: Full catalog extraction (not recommended without distributed crawling)

⚠️ Warning: Can exhaust Worker CPU limits, trigger rate limiting, and cost significant money.


Rate Limiting & Performance

Platform-Specific Delays

Apple App Store:

  • Delay: 4 seconds between requests
  • Max concurrency: 3
  • Reasoning: Aggressive scraping detection

Google Play:

  • Delay: 8 seconds between requests
  • Max concurrency: 3
  • Reasoning: Very aggressive anti-bot measures

Queue Configuration

# Main task queue
[[queues.consumers]]
queue = "rankfabric-tasks"
max_batch_size = 1
max_batch_timeout = 30
max_retries = 3
max_concurrency = 20 # For SERP tracking; app crawls self-limit

# App details queue (separate, slower)
[[queues.consumers]]
queue = "app-details-fetch"
max_batch_size = 5
max_batch_timeout = 30
max_retries = 2
max_concurrency = 3 # Slow to avoid bans

Performance Tuning

To speed up crawls:

  1. Increase max_concurrency (but watch for bans)
  2. Decrease rate limit delays (risky)
  3. Use proxy service (Oxylabs)

To reduce resource usage:

  1. Decrease related_app_depth
  2. Use limit parameter to crawl fewer categories
  3. Reduce cron frequency

Proxy Configuration (Optional)

Add Oxylabs proxy to reduce ban risk and speed up crawls.

Setup

wrangler secret put OXYLABS_USERNAME
wrangler secret put OXYLABS_PASSWORD

Automatic Usage

Worker automatically uses proxy if credentials are present:

// In crawler code
const useProxy = env.OXYLABS_USERNAME && env.OXYLABS_PASSWORD;

if (useProxy) {
// Route through Oxylabs residential proxies
// Reduces ban risk, allows higher concurrency
}

Benefits

  • Lower ban risk (rotating IPs)
  • Higher concurrency (up to 10)
  • Better reliability

Costs

  • Oxylabs pricing: ~$500-2000/month depending on volume

Cron Schedule Examples

Run Daily

{
"cron_schedule": "0 2 * * *" // 2 AM UTC every day
}

Run Weekly

{
"cron_schedule": "0 0 * * 0" // Midnight UTC every Sunday
}

Run Every 6 Hours

{
"cron_schedule": "0 */6 * * *" // Every 6 hours
}

Run Monthly

{
"cron_schedule": "0 0 1 * *" // Midnight UTC on 1st of month
}

Disable Cron (Manual Only)

{
"cron_schedule": null,
"is_enabled": false
}

Customer App Tracking

Track specific apps for paying customers.

Setup Customer Subscription

INSERT INTO customer_subscriptions (
id, customer_id, project_id, subscription_type, is_active, started_at, created_at
) VALUES (
'sub_123',
'cust_456',
'proj_789',
'app_tracking',
1,
unixepoch(),
unixepoch()
);

Add Apps to Track

INSERT INTO tracked_apps (
id, project_id, app_id, platform, is_active, created_at
) VALUES (
'track_001',
'proj_789',
'com.facebook.katana',
'google_play',
1,
unixepoch()
);

How Daily Tracking Works

  1. Cron triggers at 2 AM UTC (customer timezone configurable)
  2. Query active subscriptions:
    SELECT DISTINCT ta.* FROM tracked_apps ta
    JOIN customer_subscriptions cs ON ta.project_id = cs.project_id
    WHERE cs.subscription_type = 'app_tracking'
    AND cs.is_active = 1
    AND ta.is_active = 1
  3. For each app:
    • Crawl app's category rankings
    • Update app_category_rankings
    • Snapshot metadata to app_metadata_snapshots
  4. Billing: Count as 1 unit per app per day

Troubleshooting

Job Not Running

Check:

  1. Job is enabled: GET /api/jobs/{jobId}
  2. Cron schedule is valid
  3. Worker cron trigger is configured in wrangler.toml

Verify cron execution:

wrangler tail --format pretty | grep "cron"

Crawl Stuck in Queue

Symptoms:

  • Queue items show "processing" for hours
  • queue_status shows high pending count

Causes:

  • Rate limiting delays (expected for large crawls)
  • App store blocking requests
  • Worker timeout (30s limit)

Check queue consumer logs:

wrangler tail --format pretty | grep "crawl_category"

Solutions:

  • Wait (rate limiting is working as intended)
  • Reduce concurrency if getting blocked
  • Add proxy credentials to avoid bans

Apps Not Appearing in Catalog

Check database:

wrangler d1 execute rankfabric_db --command \
"SELECT COUNT(*) FROM apps WHERE platform = 'google_play'"

Check recent rankings:

wrangler d1 execute rankfabric_db --command \
"SELECT category_id, COUNT(*) as apps FROM app_category_rankings
WHERE scraped_at > unixepoch() - 86400
GROUP BY category_id"

Solutions:

  • Re-run crawl with related_app_depth: 1
  • Check if category IDs are correct
  • Verify app store URLs are accessible

High Error Rate

Check dead letter queue:

# View DLQ in Cloudflare dashboard
# Or query via wrangler
wrangler queues consumer rankfabric-tasks --dlq

Common errors:

  • 429 Rate Limit (reduce concurrency)
  • Timeout (decrease batch size)
  • Parse errors (app store HTML changed)

Fix:

  1. Check error logs for specific failure
  2. Adjust rate limits/concurrency
  3. Update parser if HTML structure changed
  4. Add proxy if getting blocked

Best Practices

Regular Catalog Refresh

Recommended:

  • Weekly crawl with depth: 0 for all categories
  • Monthly crawl with depth: 1 for deep discovery

Schedule:

{
"weekly_refresh": {
"cron_schedule": "0 0 * * 0",
"config": {"related_app_depth": 0}
},
"monthly_deep_crawl": {
"cron_schedule": "0 0 1 * *",
"config": {"related_app_depth": 1}
}
}

Testing New Categories

Before adding to recurring job:

# Test single category first
curl -X POST /api/admin/start-crawl \
-d '{"platform": "google_play", "category_ids": ["NEW_CATEGORY"]}'

# Verify results
curl /api/app-store/top-apps?category_id=NEW_CATEGORY&platform=google_play

# Add to job config once validated
curl -X PUT /api/jobs/admin_google_play_catalog \
-d '{"config": {"category_ids": [...include NEW_CATEGORY]}}'

Monitoring Health

Daily checks:

  1. Queue status: GET /api/admin/queue-status
  2. Failed jobs: Check DLQ count
  3. App count: Verify growth in apps table

Weekly checks:

  1. Job history: Review completion times
  2. Error patterns: Identify recurring failures
  3. Coverage: Ensure all categories crawled

Setup alerts:

  • DLQ count > 10
  • Queue pending > 100 for >1 hour
  • Job completion time > 2x average

Advanced: Custom Job Types

Create custom job types for specific use cases.

INSERT INTO jobs (id, name, type, cron_schedule, is_enabled, config, created_at, updated_at)
VALUES (
'trending_apps_daily',
'Trending Apps Discovery',
'custom_trending_discovery',
'0 12 * * *',
1,
json('{"lookback_days": 7, "min_rank_change": 10}'),
unixepoch(),
unixepoch()
);

Implement handler in src/lib/cron-jobs.js:

case 'custom_trending_discovery':
// Find apps that jumped 10+ positions in last 7 days
// Enqueue deep crawl for those apps
break;

Summary

For regular catalog maintenance:

  • Use scheduled jobs with depth: 0
  • Run weekly or daily depending on freshness needs

For discovery and backfill:

  • Use manual crawls with depth: 1
  • Limit to specific categories or use limit parameter

For customer tracking:

  • Set up tracked_apps entries
  • Enable customer app tracking job
  • Bill per app per day

Monitor regularly:

  • Check queue status
  • Review job history
  • Watch for errors in DLQ

Remember:

  • Start with depth: 0 (safe, fast)
  • Add proxy for high-volume crawling
  • Rate limits exist for a reason (avoid bans)