Data Flow
This page traces a simulation job from the API request to the final report. The
flow is owned by a Temporal workflow (SimulationWorkflow in
saas/workflows/sim_workflow.py); the API only kicks it off, and the report is
generated off-pod by Celery.
End-to-end sequence
Step by step
1. API request
create_job in saas/jobs/api.py:
- Rejects job creation when
DEMO_MODEis on (read-only deploys). - Enforces
MAX_SEED_CHARSand de-duplicates identical in-flight submissions within a 60-second window (_DUP_JOB_WINDOW_SECONDS). - Validates that a
ModelRoutingrow exists for the requested tier (no row → 500). Routing supplies themodel_id,gpu_type,max_rounds,vllm_args, andtarget_agents. - Inserts a
SimulationJobrow at statusPENDING(not yet committed) and flushes to getjob.id. - Presigns MinIO upload URLs for the sim-data artifacts via
SimDataStorage. - Starts the Temporal workflow with
id=f"sim-{job.id}"on theSIM_TASK_QUEUE(sim-queue), then storesworkflow_idandworkflow_run_idon the row and commits.
If the Temporal dispatch fails, the session is rolled back and a 500 is returned, so no orphaned job row is left behind.
2. Phase 1 — pre-GPU (Temporal activities)
Before any GPU is provisioned, the workflow runs two fail-soft activities, each with a 180-second start-to-close timeout and two attempts:
fishcloud.enrich_seed— xAI Grok web/X search enrichment (only whenenrich_webis set).fishcloud.derive_markets— LLM derivation of the per-sim prediction markets.
No pod exists yet, so a failure here bypasses the GPU-phase handler; the
workflow explicitly calls fishcloud.mark_failed before re-raising.
3. Phase 2 — GPU lifecycle
The workflow provisions and drives an ephemeral pod through a set of activities
(saas/workflows/activities/provisioning.py, pipeline.py):
fishcloud.provision_pod— create the GPU pod on the configured provider.fishcloud.wait_for_worker_health— poll until the pod's HTTP worker and vLLM are ready.fishcloud.submit_and_poll—POST /jobto the pod and poll/statusto completion. The pod uploads its rich artifacts directly to MinIO, and the activity persists the report/chat-log/graph/structured results to Postgres at the source (using sync psycopg2), so large payloads never traverse Temporal.
The workflow contains explicit recovery branches: a bad-host swap (one fresh
pod if vLLM never starts), pod-unreachable retries against the same pod (proxy
blips, no swap), and an LLM-circuit-breaker / slow-pod swap. These are bounded
so a permanent outage cannot strand a job. The same-host retries rely on
submit_and_poll being idempotent (it re-checks /status on entry).
4. Finalize
fishcloud.upload_and_finalize (saas/workflows/activities/finalization.py):
- Updates job metadata (pod id, provision/pipeline seconds).
- Requires that the pod uploaded its artifacts to MinIO — a missing upload is
fatal (raises
RuntimeError), because the report flow depends on those artifacts. - Sets
sim_data_available = True, transitions the job toREPORTING, and enqueues the Celery report task.
5. Teardown (guaranteed)
The workflow's finally block terminates whatever pod is still held via
fishcloud.terminate_pod. Bad-host and circuit-breaker swaps null out their pod
references after their in-loop terminate, so the final teardown runs at most
once per successful provision. GPU instances are ephemeral and teardown happens
even on failure — see GPU Runner.
6. Off-pod report (Celery)
generate_report_task in saas/jobs/tasks_report.py runs on the Celery worker,
not on the GPU pod:
- Idempotency guard: if the job is already terminal (
COMPLETED,FAILED, orREFUNDED), it skips. - Builds a
ReportRunnerbacked by the Anthropic client and runs it. - Transient errors retry on an escalating backoff (
30, 120, 300, 900, 1800seconds — a ~55-minute window); permanent errors and exhausted retries mark the jobFAILED. - On success it persists the report markdown and the structured payload
(
adapt_structured), derives a key insight, and uploadsreport.mdto MinIO (non-fatal if MinIO is down — the DB row is authoritative).
Job status progression
JobStatus (saas/jobs/models.py): DRAFT → PENDING → PROVISIONING → RUNNING → REPORTING → COMPLETED, with FAILED reachable from any phase. REFUNDED is a
retained legacy value (see Database Schema).
Related
- System Overview — the services involved.
- Simulation Lifecycle — the conceptual view.
- Temporal — operating the workflow engine.