All cases

Operational Initiative · Resilience

Operational Resilience & Contingency Architecture

How we eliminated 100% of automatic denials and reduced latency by 62%.

Itaú Unibanco · Collateral · 2024–2025 · Senior PM

The Critical Problem

What the credit engine did

The credit engine validated customer eligibility for contracting Collateral. When it went down, 100% of customers were denied.

In the three months prior to the project, engine outages caused ~40% drops in contracting volume on affected days.

Additional problem: latency

The collateral eligibility API ran at P95 600ms — and the overall call exited at 900ms. This made the contracting flow slow and hurt conversion.

100%
customers denied during engine outages
~40%
contracting drop on affected days

The Solution

1

D-1 Eligibility Base

Built a base containing the previous day's eligible customers. If the credit engine fails, we query that base — a customer approved yesterday remains approved today.

2

PUC Adapter 2.0

Migrated to the new credit-engine endpoint with stronger resilience, lower latency and better availability — eliminating the technical bottleneck.

3

Unified Engines

Unified the Increase and Issuance credit engines, enabling faster, more consistent decisions on a cleaner, more performant architecture.

Outcomes

MetricBeforeAfter
Availability during PUC outage0% — all denied100% — fallback active
P95 eligibility API latency600ms229ms (-62%)
Overall call latency900ms550ms (-40%)
API /increase (limit increase)912ms836ms (-8.33%)
API /available (limits hub)788ms714ms (-9.39%)

Key operating principles

🛡️

Resilience is a PM decision

The fallback was born from understanding the business impact of each outage — not just engineering.

📐

Elegant beats complex

The D-1 base solved a critical problem with simple, auditable logic.

Performance is product

Every millisecond of latency impacts conversion — PM responsibility as much as engineering.

🔗

External dependencies require a plan

Trusting third-party availability at 100% is a product risk.

📊

Translate technical into business

-62% latency = more conversion = more outcome for the bank.

Explore other casesLet's talk