Producer Send Timeout or High Latency
Producer timeout or unstable send latency usually means the client is waiting too long for broker acknowledgment, leader availability, or network completion.
This document describes the typical causes of producer timeout and latency spikes, how to distinguish broker-side problems from client-side tuning issues, and which parameters matter most during analysis.
TOC
ImpactCommon CausesWhat to CheckImportant ParametersRecommendationsFix Cluster Instability FirstUnderstand the Cost ofacks=allTune Timeouts CarefullyTune Batching for the WorkloadCheck Leader HotspotsBest PracticesImpact
Producer latency problems can affect the service in several ways:
- Higher end-to-end message delay: Even if messages are eventually written, upstream applications experience slower confirmation.
- Retry amplification: When timeouts trigger retries, the cluster can receive additional write pressure while already unhealthy.
- Duplicate write risk: Some retry patterns can create duplicate records if the producer does not use idempotent behavior.
- Downstream instability: Consumers may see bursty traffic after producer latency recovers and backlogged sends are flushed.
Common Causes
- Broker load is high and request queues grow
- The partition leader is unavailable or changes frequently
acks=allwaits for replicas while ISR is unstable- Network latency or packet loss increases round-trip time
- Message batches are too large
- Producer retry settings mask an underlying broker problem
What to Check
- Review producer logs for timeout, retry, metadata refresh, and connection reset messages.
- Check broker CPU, disk I/O, network, and request queue metrics.
- Verify whether the affected topic has under-replicated partitions or unstable ISR.
- Check partition leader movement during the same time period.
- Review producer settings such as
acks,retries,delivery.timeout.ms,request.timeout.ms,linger.ms, andbatch.size. - Confirm whether the problem affects all topics or only a subset of partitions.
Important Parameters
These parameters interact with broker health. For example, raising delivery.timeout.ms can reduce false timeouts during short disruptions, but it will not solve replica instability or overloaded brokers.
Recommendations
Fix Cluster Instability First
Fix broker or replica instability before relaxing producer timeout settings. If leaders move often or ISR is unstable, producer tuning alone will not solve the issue.
Understand the Cost of acks=all
acks=all provides stronger durability, but it also means the producer waits for the replication path to be healthy. If follower replicas are slow or missing, latency increases quickly.
Tune Timeouts Carefully
Increase delivery.timeout.ms or request.timeout.ms only if broker behavior is otherwise healthy and the workload legitimately needs more time. Do not use very large timeouts to hide cluster problems.
Tune Batching for the Workload
Batching can improve throughput, but larger batches can increase per-request latency and make timeout behavior harder to diagnose. Tune linger.ms and batch.size according to message size and latency goals.
Check Leader Hotspots
If only a few leaders are overloaded, spread traffic across more partitions or rebalance leadership if the platform supports it.
Best Practices
- Monitor producer latency together with ISR health and request queue metrics.
- Keep retry settings, timeout settings, and business retry logic aligned.
- Separate normal batching delay from real timeout behavior when analyzing latency.
- Treat producer latency spikes and under-replicated partitions as related symptoms until proven otherwise.
- Use idempotent producer behavior when duplicate sends would be costly.