Added: vault failure diagnostics (commit 48df38d6)
The 502 fix makes vault failures non-fatal, but the causes were nearly invisible. This commit closes the blind spots (logging only, no behavior change):
HCP (hashicorp-vault.ts) — the three silent !res.ok paths now log status + Vault error body:
step=jwt-login-rejected— the dominant current failure (~182/dayVaultJwtLoginError); logsstatus,addr,jwtAuthPath,namespace,used_dashboard_token,body.step=kv-read-failed,step=kv-write-failed.
WorkOS (workos-vault.ts) — describeWorkosError() surfaces request_id / code / errors[] / rawData (which flattenCauseChain dropped) on the opaque 422, plus a logged updateObject failure path.
Datadog queries to use once deployed
service:composio-dashboard env:production @step:jwt-login-rejected
service:composio-dashboard env:production @feature:workos-vault "createObject failed" # → @workos_error.request_id
Root-cause summary (infra-side, not in this PR)
| Cause | Evidence | Owner action |
|---|---|---|
| 05-27 502 spike | 15,244 HCP getaddrinfo ENOTFOUND, all in 05-27 00:00–12:00; DNS resolves now | Resolved (cluster DNS propagation). Monitor. |
| Ongoing HCP fallbacks | ~182/day VaultJwtLoginError (Vault rejecting the JWT), not DNS/timeout | Fix JWT-auth role: confirm HCP_VAULT_USE_DASHBOARD_TOKEN, Vault trusts the dashboard JWKS/issuer, role bound_audiences/user_claim, and the KV policy template matches the path. The new jwt-login-rejected body will show which. |
| WorkOS 422 (broken fallback) | All 422, successful updates → fails on object creation; started 05-27 11:03 |