> ## Documentation Index
> Fetch the complete documentation index at: https://docs.voqo.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Batch Failure and Recovery Runbook

> Diagnose stalled jobs, retry safely, and escalate with complete diagnostics.

## Audience

* Support agents handling batch outbound incidents
* Workspace admins/operators running outbound campaigns

## Prerequisites

* Access to batch campaign/job views
* Access to batch logs or status details
* Campaign ID and job ID for the incident

## User-visible symptoms to triage

* Job stuck in running state with no completed calls
* Campaign delivers far fewer calls than expected
* Repeated call failures for a large contact segment
* Duplicate-call concerns after retries

## Operational checks in order

### 1) Confirm job health and stage

1. Open the affected job and record current status.
2. Confirm whether progress counters are moving.
3. Compare expected contacts vs processed contacts.

### 2) Validate permit and dispatch behavior

1. Check whether dispatch is blocked by permit/concurrency limits.
2. Confirm retries are occurring for transient failures.
3. Verify that non-retryable failures are not being retried indefinitely.

### 3) Check duplicate prevention behavior

1. Confirm there is no duplicate job run started unintentionally.
2. Verify repeated deliveries are deduplicated at dispatch layer.
3. Ensure recipients are not duplicated in upload source.

### 4) Determine retry strategy

* Retry only when failures are transient (provider timeout, temporary unavailability).
* Do not mass-retry invalid numbers or permanently failed payloads.
* Prefer targeted retries for affected subset when possible.

## Escalation tree

### Level 1: Operator self-service

* Recheck job configuration (agent, number, campaign, upload).
* Validate contact file quality for malformed/duplicate numbers.
* Retry a small sample cohort before full relaunch.

### Level 2: Support-assisted recovery

* Confirm permit/concurrency symptoms and dispatch progression.
* Collect structured diagnostics (below) from customer.
* Recommend corrected retry path and monitor first 10-20 dispatches.

### Level 3: Engineering escalation

Escalate when any of these occur:

* Job remains stalled after validated retry path
* Duplicate processing persists despite dedupe checks
* Provider or dispatch failures exceed expected transient threshold

## Required diagnostics for support escalation

* Workspace ID
* Campaign ID
* Batch Job ID
* Upload ID (if applicable)
* Time window and timezone
* Symptom summary with screenshot
* Approximate failed vs successful counts

## Expected recovery outcomes

* Job progresses and completes with traceable success/failure counts
* Retries are controlled and non-duplicative
* Customer receives clear next action and ETA

## Related docs

* [Run Batch Outbound Calls](batch-outbound-calls-overview)
* [Campaigns](campaigns-batch-outbound-calls)
* [Batch Jobs](jobs-batch-outbound-calls)
* [Troubleshooting Hub](../troubleshooting/index)
