“It was the worst network outage…I ever seen”
I got bit by probably the worst outage ever this weekend. Chain of events:
- F5 BigIP is plugged in to Cisco 3750 switch with default configuration
- Thanks to DTP, trunk is formed and STP TCN on vlan 1 is generated
- R-PVST+ BPDU is sent to pair of Cisco 6509 core switches
- Core switches each pass the BPDU to a stack of Dell/Force10 S4810s, which are running 802.1w, connected dual via 2×10 Gbps LACP etherchannels
- The Dells pass the BPDUs to their access switches, which are all connected via 2×10 Gbps LACP etherchannel.
- Converting the BPDUs back to R-PVST+, every single Cisco in the data center sees the “duplicate” (note the quotes here) BPDU come across the channel with different bridge IDs, concludes the etherchannel is misconfigured, and disables its 2x10Gbps uplink
Ouch. Ouch. Ouch. Ouch.
While note the longest outage I’ve experienced, this probably has been the worst. There’s lots of things to go about fixing, namely DTP should be disabled on 3750s with “switchport mode access” for all access layer ports. But for the main culprit, it really looks like Cisco’s etherchannel misconfigation guard need some re-examination and should be used with extreme caution in mixed environments.
At a minimum, enable error recovery, so that if it does kick, it’s only temporary:
errdisable recovery cause channel-misconfig (STP)
And lower the interval from the default of 5 minutes to the shortest time possible:
errdisable recovery interval 30
If running LACP or PAgP for all aggregate links, it’s fine to disable the feature altogether:
no spanning-tree etherchannel guard misconfig
This is because should there be a mismatch, it will be removed from the bundle and set to independent mode, at which point Spanning-Tree will kick in and make all well until the configuration can be corrected.