Troubleshoot RabbitMQ

Use this guide when a RabbitMQ instance is unhealthy, clients cannot connect, queues are growing unexpectedly, or an upgrade or migration does not behave as planned.

For each problem, start with the symptom, confirm the impact, run the suggested checks, and only then choose a remediation.

Pod is Running but the Cluster is Not Healthy

ItemGuidance
SymptomPods are Running or Ready, but the workload still fails or the broker node count is lower than expected.
ImpactProducers, consumers, or queue replicas can fail or route unevenly.
Common causesPeer discovery problems, storage issues, failed cluster join, or split cluster membership.
Checkskubectl get rabbitmqcluster <instance-name>, kubectl get pods -l app.kubernetes.io/name=<instance-name>, kubectl exec <pod> -- rabbitmqctl cluster_status
RecommendationsCompare status.conditions with rabbitmqctl cluster_status. If Pod readiness and broker membership disagree, treat the cluster as degraded until all expected broker nodes appear in cluster_status.

Clients Cannot Authenticate or Authorize

ItemGuidance
SymptomClients report access refused, permission denied, or cannot open channels.
ImpactProducers cannot publish and consumers cannot consume.
Common causesWrong user, wrong password, missing virtual host, or insufficient configure, write, or read permissions.
Checksrabbitmqadmin list users, rabbitmqadmin list vhosts, rabbitmqadmin list permissions, application configuration review
RecommendationsVerify that the application uses the intended virtual host and dedicated user, and confirm that permissions match the required resource naming pattern.

TLS Connections Fail

ItemGuidance
SymptomTLS handshake failures, unknown CA errors, or hostname verification errors.
ImpactClients and operational tools cannot connect to RabbitMQ.
Common causesMissing CA trust, wrong SANs in the certificate, disabled plain listeners before clients were updated, or wrong endpoint port.
Checkskubectl explain rabbitmqcluster.spec.tls, kubectl exec <pod> -- rabbitmq-diagnostics listeners, client-side TLS trust store review
RecommendationsReissue certificates with the required service and external names, distribute the CA certificate, and verify that clients use amqps:// or HTTPS with the correct port.

Publishing is Blocked or Slow

ItemGuidance
SymptomPublishers stall, connection throughput drops, or application logs show blocked connections.
ImpactMessage ingestion falls behind or stops.
Common causesMemory alarms, disk alarms, or queue backlog growth.
Checkskubectl exec <pod> -- rabbitmq-diagnostics status, rabbitmqadmin list queues name messages message_bytes consumers, platform monitoring dashboards
RecommendationsResolve disk or memory pressure first. Then investigate why consumers are not keeping up and whether queue TTL or length limits should be added.

Queues Grow Without Consumers Catching Up

ItemGuidance
Symptommessages_ready or messages_unacknowledged grows for a sustained period.
ImpactDisk usage rises, publish latency can increase, and consumers can fall far behind.
Common causesConsumer outages, low prefetch, slow downstream dependencies, or a retry loop.
Checksrabbitmqadmin list queues name messages consumers arguments, application consumer metrics, retry and DLQ topology review
RecommendationsRestore or scale consumers, inspect retry logic, and add queue policies such as message-ttl, max-length, or DLQ routing when the workload requires bounded backlog behavior.

Upgrade or Migration is Stuck

ItemGuidance
SymptomA version upgrade or Shovel migration does not complete as expected.
ImpactThe environment remains in a prolonged change window.
Common causesCluster health issues before the change, incompatible plugins, insufficient storage, or unreachable source or destination brokers.
Checkskubectl get rabbitmqcluster <instance-name> -o yaml, kubectl exec <pod> -- rabbitmq-plugins list -e, kubectl exec <pod> -- rabbitmqctl shovel_status, kubectl exec <pod> -- rabbitmq-diagnostics status
RecommendationsReconfirm baseline health before continuing, verify network reachability and plugin compatibility, and stop the change if the cluster cannot maintain healthy membership.

Management API or UI is Unavailable

ItemGuidance
SymptomThe management endpoint times out or returns errors.
Impactrabbitmqadmin, the management UI, and operational automation cannot use the HTTP API.
Common causesService exposure problems, TLS mismatch, management listener disabled, or the broker is overloaded.
Checkskubectl get svc <instance-name>, kubectl get endpoints <instance-name>, kubectl exec <pod> -- rabbitmq-diagnostics listeners
RecommendationsConfirm that the service type and port mapping match the access method, and verify whether TLS-only listeners are enabled.