Incident Summary:
During the incident window, one or more interfaces participating in a Link Aggregation Group (LAG) experienced instability due to intermittent link state changes - commonly referred to as interface flapping. This behaviour caused the affected interfaces to rapidly transition between up and down states.
Root Cause:
The flapping condition was caused by a degraded physical link within the LAG. This instability led to repeated link renegotiations, disrupting normal traffic flow and preventing the system from maintaining a stable aggregated interface.
Impact:
The LAG could not consistently deliver or receive traffic due to frequent interface transitions.
MAC address flapping occurred at the switch level, causing instability in the forwarding tables and intermittent packet loss.
Although firewall state synchronisation and heartbeat traffic were handled by a separate, unaffected LAG, the affected LAG was responsible for distribution network connectivity. Its failure caused a complete service outage, as traffic could not be routed correctly through the firewall cluster despite the HA state remaining intact.
An issue connecting to our remote OOB management system delayed our engineering response to this incident.
Remedial Actions:
The faulty physical link was isolated, and both the optical transceivers and cables were replaced to restore stable connectivity. Behaviours observed during the incident are currently under review by our engineering team to fully understand the contributing factors, particularly the firewall cluster’s response to the external-facing LAG failure.
We will be issuing a follow-up scheduled maintenance window to reduce the likelihood of recurrence. This will include a review of HA link monitoring configurations, firewall failover logic, and an audit of all LAGs across redundant systems to ensure they are functionally isolated, resilient, and correctly monitored.
This issue is isolated from the incident on Thursday 3rd April which is currently being reviewed by Juniper TAC.