Crawl Management & Job Scheduling

Complete guide to managing app store crawls, scheduled jobs, and queue operations.

Overview

The crawl management system handles:

Scheduled jobs - Recurring cron-based crawls
Manual crawls - On-demand category/platform crawls
Queue management - Monitor and control crawl queue
Job runs - Track history and status of crawls
Rate limiting - Platform-specific delays and concurrency

Job Types

1. Admin Catalog Refresh

Recurring job that crawls all categories for a platform to maintain fresh catalog data.

Purpose: Keep app catalog up-to-date for search and recommendations

Default Config:

{
  "id": "admin_google_play_catalog",
  "name": "Google Play Catalog Refresh",
  "type": "admin_catalog_refresh",
  "cron_schedule": "0 0 * * 0",  // Weekly, Sundays midnight UTC
  "is_enabled": true,
  "config": {
    "platform": "google_play",
    "all_categories": true,
    "related_app_depth": 0,
    "chart_types": ["topselling_free"],
    "device": "phone"
  }
}

Configurable Parameters:

cron_schedule - When to run (cron syntax)
is_enabled - Enable/disable without deleting
related_app_depth - 0 (category only), 1 (+ related), 2 (recursive)
chart_types - Which charts to crawl
device - Device type (phone/tablet/iphone/ipad)
limit - Optional: limit number of categories (for testing)

2. Customer App Tracking

Daily job that tracks specific apps for paying customers.

Purpose: Monitor app rankings and metadata for customer subscriptions

Default Config:

{
  "id": "customer_google_play_tracking",
  "name": "Customer: Google Play App Tracking",
  "type": "customer_app_tracking",
  "cron_schedule": "0 2 * * *",  // Daily at 2 AM UTC
  "is_enabled": true,
  "config": {
    "platform": "google_play"
  }
}

How it works:

Queries tracked_apps table for active customer apps
Crawls each app's category rankings
Updates app_category_rankings and app_metadata_snapshots
Billing: Per app per day

Managing Jobs

List All Jobs

Endpoint: GET /api/jobs

curl https://your-worker.workers.dev/api/jobs

Response:

{
  "jobs": [
    {
      "id": "admin_google_play_catalog",
      "name": "Google Play Catalog Refresh",
      "type": "admin_catalog_refresh",
      "cron_schedule": "0 0 * * 0",
      "is_enabled": true,
      "config": {...},
      "last_run": "2024-01-14T00:00:00Z",
      "next_run": "2024-01-21T00:00:00Z",
      "last_run_status": "completed"
    }
  ]
}

Update Job Configuration

Endpoint: PUT /api/jobs/{jobId}

curl -X PUT https://your-worker.workers.dev/api/jobs/admin_google_play_catalog \
  -H "Content-Type: application/json" \
  -d '{
    "cron_schedule": "0 2 * * *",
    "is_enabled": true,
    "config": {
      "related_app_depth": 1
    }
  }'

Partial updates supported - Only include fields you want to change.

Response:

{
  "success": true,
  "job": {
    "id": "admin_google_play_catalog",
    "cron_schedule": "0 2 * * *",
    "config": {
      "platform": "google_play",
      "all_categories": true,
      "related_app_depth": 1,
      "chart_types": ["topselling_free"],
      "device": "phone"
    },
    "updated_at": 1705276800000
  }
}

Trigger Job Manually

Run a job immediately without waiting for cron schedule.

Endpoint: POST /api/jobs/{jobId}/run

curl -X POST https://your-worker.workers.dev/api/jobs/admin_google_play_catalog/run

Response:

{
  "success": true,
  "run_id": "run_abc123",
  "queued_categories": 48,
  "estimated_duration": "30-45 minutes"
}

View Job History

Endpoint: GET /api/jobs/{jobId}/runs

curl https://your-worker.workers.dev/api/jobs/admin_google_play_catalog/runs

Response:

{
  "job_id": "admin_google_play_catalog",
  "runs": [
    {
      "id": "run_abc123",
      "job_id": "admin_google_play_catalog",
      "status": "completed",
      "started_at": "2024-01-14T00:00:00Z",
      "completed_at": "2024-01-14T00:42:15Z",
      "metadata": {
        "categories_processed": 48,
        "apps_discovered": 1440,
        "errors": 0
      }
    }
  ]
}

View Job Queue

See what's currently queued for a job.

Endpoint: GET /api/jobs/{jobId}/queue

curl https://your-worker.workers.dev/api/jobs/admin_google_play_catalog/queue

Response:

{
  "job_id": "admin_google_play_catalog",
  "queue_items": [
    {
      "id": "queue_item_123",
      "category_id": "SOCIAL",
      "status": "processing",
      "created_at": "2024-01-14T00:00:00Z",
      "started_at": "2024-01-14T00:01:00Z"
    },
    {
      "id": "queue_item_124",
      "category_id": "PRODUCTIVITY",
      "status": "pending",
      "created_at": "2024-01-14T00:00:01Z"
    }
  ]
}

Manual Crawls

Start One-Off Crawl

Trigger an immediate crawl without creating a scheduled job.

Endpoint: POST /api/admin/start-crawl

curl -X POST https://your-worker.workers.dev/api/admin/start-crawl \
  -H "Content-Type: application/json" \
  -d '{
    "platform": "google_play",
    "chart_types": ["topselling_free"],
    "device": "phone",
    "limit": 5,
    "related_app_depth": 0
  }'

Parameters:

Parameter	Type	Required	Description
`platform`	string	Yes	`apple` or `google_play`
`chart_types`	array	No	Google Play only: `["topselling_free", "topselling_paid", "topgrossing"]`
`device`	string	No	`phone`, `tablet`, `iphone`, `ipad`
`limit`	number	No	Limit number of categories (for testing)
`category_ids`	array	No	Specific categories to crawl
`related_app_depth`	number	No	0 (default), 1, or 2

Response:

{
  "success": true,
  "queued": 5,
  "categories": [
    {"id": "SOCIAL", "name": "Social"},
    {"id": "PRODUCTIVITY", "name": "Productivity"},
    {"id": "COMMUNICATION", "name": "Communication"},
    {"id": "ENTERTAINMENT", "name": "Entertainment"},
    {"id": "TOOLS", "name": "Tools"}
  ],
  "estimated_apps": 150,
  "estimated_duration": "5-10 minutes"
}

Crawl Specific Categories

curl -X POST https://your-worker.workers.dev/api/admin/start-crawl \
  -H "Content-Type: application/json" \
  -d '{
    "platform": "google_play",
    "category_ids": ["SOCIAL", "PRODUCTIVITY"],
    "chart_types": ["topselling_free"],
    "related_app_depth": 1
  }'

Test Crawl (Single Category)

Quick test with minimal apps.

curl -X POST https://your-worker.workers.dev/api/admin/start-crawl \
  -H "Content-Type: application/json" \
  -d '{
    "platform": "apple",
    "device": "iphone",
    "limit": 1,
    "related_app_depth": 0
  }'

Queue Status & Monitoring

Overall Queue Status

Endpoint: GET /api/admin/queue-status

curl https://your-worker.workers.dev/api/admin/queue-status

Response:

{
  "last_24h": {
    "total_jobs": 148,
    "by_platform": {
      "google_play": 48,
      "apple": 100
    },
    "by_category": {
      "SOCIAL": 1,
      "apple_grouping_25188": 1
    },
    "apps_discovered": 3360,
    "apps_updated": 3360,
    "errors": 0
  },
  "queue": {
    "pending": 0,
    "processing": 3,
    "completed": 145,
    "failed": 0
  },
  "rate_limiting": {
    "apple_delay_seconds": 4,
    "google_play_delay_seconds": 8,
    "current_concurrency": 3
  }
}

Real-Time Queue Metrics

View current queue depth and processing rate.

Endpoint: GET /api/admin/queue-metrics

curl https://your-worker.workers.dev/api/admin/queue-metrics

Response:

{
  "main_queue": {
    "name": "rankfabric-tasks",
    "pending": 12,
    "processing": 3,
    "completed_last_hour": 48,
    "failed_last_hour": 0,
    "avg_processing_time_ms": 8500
  },
  "app_details_queue": {
    "name": "app-details-fetch",
    "pending": 124,
    "processing": 3,
    "completed_last_hour": 67,
    "failed_last_hour": 1
  },
  "clickhouse_queue": {
    "name": "clickhouse-ingestion",
    "pending": 0,
    "processing": 0,
    "completed_last_hour": 145
  }
}

Controls how deep the crawler goes beyond category pages.

Depth 0: Category Pages Only (SAFE)

What it does:

Crawls only the category listing page
Gets ~30 top apps per category

Volume:

Google Play: 48 categories × 30 apps = 1,440 apps
Apple: 64 categories × 30 apps = 1,920 apps
Total: ~3,360 apps

Duration: 30-45 minutes

Use case: Regular catalog refresh, low resource usage

What it does:

Crawls category listing page
For each app, crawls its detail page
Discovers apps in "Similar Apps" section

Volume:

Google Play: 48 categories × ~1,000 apps = 48,000 apps
Apple: 64 categories × ~300 apps = 19,200 apps
Total: ~67,000 apps

Duration: 6-8 hours

Use case: Initial catalog build, competitor discovery

Depth 2: Recursive Discovery (DANGEROUS)

What it does:

Crawls category → apps → related apps → related apps...
Continues until no new apps found

Volume:

1,000,000+ apps (entire app store)

Duration: Days

Use case: Full catalog extraction (not recommended without distributed crawling)

⚠️ Warning: Can exhaust Worker CPU limits, trigger rate limiting, and cost significant money.

Rate Limiting & Performance

Platform-Specific Delays

Apple App Store:

Delay: 4 seconds between requests
Max concurrency: 3
Reasoning: Aggressive scraping detection

Google Play:

Delay: 8 seconds between requests
Max concurrency: 3
Reasoning: Very aggressive anti-bot measures

Queue Configuration

# Main task queue
[[queues.consumers]]
queue = "rankfabric-tasks"
max_batch_size = 1
max_batch_timeout = 30
max_retries = 3
max_concurrency = 20  # For SERP tracking; app crawls self-limit

# App details queue (separate, slower)
[[queues.consumers]]
queue = "app-details-fetch"
max_batch_size = 5
max_batch_timeout = 30
max_retries = 2
max_concurrency = 3  # Slow to avoid bans

Performance Tuning

To speed up crawls:

Increase max_concurrency (but watch for bans)
Decrease rate limit delays (risky)
Use proxy service (Oxylabs)

To reduce resource usage:

Decrease related_app_depth
Use limit parameter to crawl fewer categories
Reduce cron frequency

Proxy Configuration (Optional)

Add Oxylabs proxy to reduce ban risk and speed up crawls.

Setup

wrangler secret put OXYLABS_USERNAME
wrangler secret put OXYLABS_PASSWORD

Automatic Usage

Worker automatically uses proxy if credentials are present:

// In crawler code
const useProxy = env.OXYLABS_USERNAME && env.OXYLABS_PASSWORD;

if (useProxy) {
  // Route through Oxylabs residential proxies
  // Reduces ban risk, allows higher concurrency
}

Benefits

Lower ban risk (rotating IPs)
Higher concurrency (up to 10)
Better reliability

Costs

Oxylabs pricing: ~$500-2000/month depending on volume

Cron Schedule Examples

Run Daily

{
  "cron_schedule": "0 2 * * *"  // 2 AM UTC every day
}

Run Weekly

{
  "cron_schedule": "0 0 * * 0"  // Midnight UTC every Sunday
}

Run Every 6 Hours

{
  "cron_schedule": "0 */6 * * *"  // Every 6 hours
}

Run Monthly

{
  "cron_schedule": "0 0 1 * *"  // Midnight UTC on 1st of month
}

Disable Cron (Manual Only)

{
  "cron_schedule": null,
  "is_enabled": false
}

Customer App Tracking

Track specific apps for paying customers.

Setup Customer Subscription

INSERT INTO customer_subscriptions (
  id, customer_id, project_id, subscription_type, is_active, started_at, created_at
) VALUES (
  'sub_123',
  'cust_456',
  'proj_789',
  'app_tracking',
  1,
  unixepoch(),
  unixepoch()
);

Add Apps to Track

INSERT INTO tracked_apps (
  id, project_id, app_id, platform, is_active, created_at
) VALUES (
  'track_001',
  'proj_789',
  'com.facebook.katana',
  'google_play',
  1,
  unixepoch()
);

How Daily Tracking Works

Cron triggers at 2 AM UTC (customer timezone configurable)

Query active subscriptions:

SELECT DISTINCT ta.* FROM tracked_apps ta
JOIN customer_subscriptions cs ON ta.project_id = cs.project_id
WHERE cs.subscription_type = 'app_tracking'
  AND cs.is_active = 1
  AND ta.is_active = 1

For each app:
- Crawl app's category rankings
- Update app_category_rankings
- Snapshot metadata to app_metadata_snapshots
Billing: Count as 1 unit per app per day

Troubleshooting

Job Not Running

Check:

Job is enabled: GET /api/jobs/{jobId}
Cron schedule is valid
Worker cron trigger is configured in wrangler.toml

Verify cron execution:

wrangler tail --format pretty | grep "cron"

Crawl Stuck in Queue

Symptoms:

Queue items show "processing" for hours
queue_status shows high pending count

Causes:

Rate limiting delays (expected for large crawls)
App store blocking requests
Worker timeout (30s limit)

Check queue consumer logs:

wrangler tail --format pretty | grep "crawl_category"

Solutions:

Wait (rate limiting is working as intended)
Reduce concurrency if getting blocked
Add proxy credentials to avoid bans

Apps Not Appearing in Catalog

Check database:

wrangler d1 execute rankfabric_db --command \
  "SELECT COUNT(*) FROM apps WHERE platform = 'google_play'"

Check recent rankings:

wrangler d1 execute rankfabric_db --command \
  "SELECT category_id, COUNT(*) as apps FROM app_category_rankings 
   WHERE scraped_at > unixepoch() - 86400 
   GROUP BY category_id"

Solutions:

Re-run crawl with related_app_depth: 1
Check if category IDs are correct
Verify app store URLs are accessible

High Error Rate

Check dead letter queue:

# View DLQ in Cloudflare dashboard
# Or query via wrangler
wrangler queues consumer rankfabric-tasks --dlq

Common errors:

429 Rate Limit (reduce concurrency)
Timeout (decrease batch size)
Parse errors (app store HTML changed)

Fix:

Check error logs for specific failure
Adjust rate limits/concurrency
Update parser if HTML structure changed
Add proxy if getting blocked

Best Practices

Regular Catalog Refresh

Recommended:

Weekly crawl with depth: 0 for all categories
Monthly crawl with depth: 1 for deep discovery

Schedule:

{
  "weekly_refresh": {
    "cron_schedule": "0 0 * * 0",
    "config": {"related_app_depth": 0}
  },
  "monthly_deep_crawl": {
    "cron_schedule": "0 0 1 * *",
    "config": {"related_app_depth": 1}
  }
}

Testing New Categories

Before adding to recurring job:

# Test single category first
curl -X POST /api/admin/start-crawl \
  -d '{"platform": "google_play", "category_ids": ["NEW_CATEGORY"]}'

# Verify results
curl /api/app-store/top-apps?category_id=NEW_CATEGORY&platform=google_play

# Add to job config once validated
curl -X PUT /api/jobs/admin_google_play_catalog \
  -d '{"config": {"category_ids": [...include NEW_CATEGORY]}}'

Monitoring Health

Daily checks:

Queue status: GET /api/admin/queue-status
Failed jobs: Check DLQ count
App count: Verify growth in apps table

Weekly checks:

Job history: Review completion times
Error patterns: Identify recurring failures
Coverage: Ensure all categories crawled

Setup alerts:

DLQ count > 10
Queue pending > 100 for >1 hour
Job completion time > 2x average

Advanced: Custom Job Types

Create custom job types for specific use cases.

INSERT INTO jobs (id, name, type, cron_schedule, is_enabled, config, created_at, updated_at)
VALUES (
  'trending_apps_daily',
  'Trending Apps Discovery',
  'custom_trending_discovery',
  '0 12 * * *',
  1,
  json('{"lookback_days": 7, "min_rank_change": 10}'),
  unixepoch(),
  unixepoch()
);

Implement handler in src/lib/cron-jobs.js:

case 'custom_trending_discovery':
  // Find apps that jumped 10+ positions in last 7 days
  // Enqueue deep crawl for those apps
  break;

Summary

For regular catalog maintenance:

Use scheduled jobs with depth: 0
Run weekly or daily depending on freshness needs

For discovery and backfill:

Use manual crawls with depth: 1
Limit to specific categories or use limit parameter

For customer tracking:

Set up tracked_apps entries
Enable customer app tracking job
Bill per app per day

Monitor regularly:

Check queue status
Review job history
Watch for errors in DLQ

Remember:

Start with depth: 0 (safe, fast)
Add proxy for high-volume crawling
Rate limits exist for a reason (avoid bans)

Overview​

Job Types​

1. Admin Catalog Refresh​

2. Customer App Tracking​

Managing Jobs​

List All Jobs​

Update Job Configuration​

Trigger Job Manually​

View Job History​

View Job Queue​

Manual Crawls​

Start One-Off Crawl​

Crawl Specific Categories​

Test Crawl (Single Category)​

Queue Status & Monitoring​

Overall Queue Status​

Real-Time Queue Metrics​

Related App Depth Explained​

Depth 0: Category Pages Only (SAFE)​

Depth 1: Category + Related Apps​

Depth 2: Recursive Discovery (DANGEROUS)​

Rate Limiting & Performance​

Platform-Specific Delays​

Queue Configuration​

Performance Tuning​

Proxy Configuration (Optional)​

Setup​

Automatic Usage​

Benefits​

Costs​

Cron Schedule Examples​

Run Daily​

Run Weekly​

Run Every 6 Hours​

Run Monthly​

Disable Cron (Manual Only)​

Customer App Tracking​

Setup Customer Subscription​

Add Apps to Track​

How Daily Tracking Works​

Troubleshooting​

Job Not Running​

Crawl Stuck in Queue​

Apps Not Appearing in Catalog​

High Error Rate​

Best Practices​

Regular Catalog Refresh​

Testing New Categories​

Monitoring Health​

Advanced: Custom Job Types​

Example: Trending Apps Discovery​

Summary​

Overview

Job Types

1. Admin Catalog Refresh

2. Customer App Tracking

Managing Jobs

List All Jobs

Update Job Configuration

Trigger Job Manually

View Job History

View Job Queue

Manual Crawls

Start One-Off Crawl

Crawl Specific Categories

Test Crawl (Single Category)

Queue Status & Monitoring

Overall Queue Status

Real-Time Queue Metrics

Related App Depth Explained

Depth 0: Category Pages Only (SAFE)

Depth 1: Category + Related Apps

Depth 2: Recursive Discovery (DANGEROUS)

Rate Limiting & Performance

Platform-Specific Delays

Queue Configuration

Performance Tuning

Proxy Configuration (Optional)

Setup

Automatic Usage

Benefits

Costs

Cron Schedule Examples

Run Daily

Run Weekly

Run Every 6 Hours

Run Monthly

Disable Cron (Manual Only)

Customer App Tracking

Setup Customer Subscription

Add Apps to Track

How Daily Tracking Works

Troubleshooting

Job Not Running

Crawl Stuck in Queue

Apps Not Appearing in Catalog

High Error Rate

Best Practices

Regular Catalog Refresh

Testing New Categories

Monitoring Health

Advanced: Custom Job Types

Example: Trending Apps Discovery

Summary