IOS-XE 16.6 (Everest) software upgrade on Cisco 3650 & 3850 series switches

Had a 3650 switch stack I needed to upgrade from IOS 16.6.5 to 16.6.8.  The software upgrade procedure is quite different from the 3550/3560/3750s I’d done in the past.

First step is simply download the .bin file to the first switch’s onboard flash drive:

Switch# copy http://mysite/cat3k_caa-universalk9.16.06.08.SPA.bin flash:

I then verify the signature:

Switch#verify flash:cat3k_caa-universalk9.16.06.08.SPA.bin
Starting image verification
Digital signature successfully verified in package cat3k_caa-rpbase.16.06.08.SPA.pkg
Digital signature successfully verified in package cat3k_caa-srdriver.16.06.08.SPA.pkg
Digital signature successfully verified in package cat3k_caa-webui.16.06.08.SPA.pkg
Digital signature successfully verified in package cat3k_caa-rpcore.16.06.08.SPA.pkg
Digital signature successfully verified in package cat3k_caa-guestshell.16.06.08.SPA.pkg
Digital signature successfully verified in file flash:cat3k_caa-universalk9.16.06.08.SPA.bin

The actual command to start the upgrade is quite a mouthful:

Switch#request platform software package install switch all file flash:cat3k_caa-universalk9.16.06.08.SPA.bin new auto-copy

--- Starting install local lock acquisition on switch 1 ---
Finished install local lock acquisition on switch 1
....
SUCCESS: Software provisioned. New software will load on reboot.
[2]: Finished install successful on switch 2
Checking status of install on [1 2]
[1 2]: Finished install in switch 1 2
SUCCESS: Finished install: Success on [1 2]

This distributes the software as .pkg files on each stack member.

To proceed with the actual upgrade I simply rebooted the entire stack. Ideally I would have done a rolling upgrade similar to how I did 3750s a few years ago, but could not determine if this is supported on the 3650/3850

Other useful commands:

To copy the image on the fly during the upgrade process:

Switch# request platform software package expand switch all file http://mysite.com/mypath/cat3k_caa-universalk9.16.12.04.SPA.bin to flash: auto-copy

To clean up old packages to free up disk space:

Switch# request platform software package clean switch all file flash:

Advertisement

Jumbo Frames on Nexus 93180YC-EX, 5672UP, and perhaps others

Cisco’s documentation implies that to enable jumbo frames on the 5K and 9K line, one must simply set mtu 9216 on the physical and logical L1/L2 interfaces:

Configure and Verify Maximum Transmission Unit on Cisco Nexus Platforms

However, currently working with the 93180YC-EX and previously worked with the 5672UP, I can tell you that both are actually based on the obscure 6K line.

And, per the 5672UP documentation, in order to get jumbo frames, you must do this additional step:

policy-map type network-qos jumbo
  class type network-qos class-default
          mtu 9216
system qos
  service-policy type network-qos jumbo

After applying this, also do a set mtu 9216 on the on the L3 SVIs:

Switch(config)#interface Vlan200
  no shutdown
  mtu 9216

Switch# show interface vl200 Vlan200 is up, line protocol is up, autostate enabled Hardware is EtherSVI, address is 70ea.1a44.d0a7 Internet Address is 192.168.200.1/24 MTU 9216 bytes, BW 1000000 Kbit, DLY 10 usec,

Switch# show interface et1/17
Ethernet1/17 is up
admin state is up, Dedicated Interface
Belongs to Po17
Hardware: 100/1000/10000/25000 Ethernet, address: 70ea.1a44.d0b8 (bia 70ea.1a44.d0b8)
Description: Storage Filer
MTU 9216 bytes, BW 10000000 Kbit, DLY 10 usec
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation ARPA, medium is broadcast
Port mode is trunk
full-duplex, 10 Gb/s, media type is 10G

 

93180YC – Jumbo Frames

 

Monitoring CPU & Memory in IOS-XE

ios-xe_cpu

One important thing to understanding in IOS-XE is the different numbers that can be returned when checking CPU and memory statistics.  There’s some very down in the weeds docs on this, but the simplest way to break it down is process vs. platform.  Processes is essentially control plane, while platform is data plane.

CPU

Processor CPU

CLI command: show processes cpu

SNMP OIDs:

1.3.6.1.4.1.9.2.1.56.0 = 5 second
1.3.6.1.4.1.9.2.1.57.0 = 1 minute
1.3.6.1.4.1.9.2.1.58.0 = 5 minute

Platform CPU

CLI command: show processes cpu platform

SNMP OIDs:

1.3.6.1.4.1.9.9.109.1.1.1.1.3.7 = 5 second
1.3.6.1.4.1.9.9.109.1.1.1.1.4.7 = 1 minute
1.3.6.1.4.1.9.9.109.1.1.1.1.5.7 = 5 minute

Note – Most platforms will be multi-core.

Memory

Processor Memory

CLI command: show processes memory

SNMP OIDs:

1.3.6.1.4.1.9.9.48.1.1.1.5.1 = Memory Used
1.3.6.1.4.1.9.9.48.1.1.1.6.1 = Memory Free

Platform Memory

CLI command: show platform resources

SNMP OIDs:

1.3.6.1.4.1.9.9.109.1.1.1.1.12.7 = Memory Used
1.3.6.1.4.1.9.9.109.1.1.1.1.13.7 = Memory Free
1.3.6.1.4.1.9.9.109.1.1.1.1.27.7 = Memory Committed

Cacti Templates

These were written for Cacti 0.8.8f

https://spaces.hightail.com/space/FoUD1PvlXA

 

LACP with Palo Alto Firewalls

Today’s task was get LACP working on a Palo Alto, so traffic and fault tolerance could be spread across multiple members of a Cisco 3750X switch stack.  The default settings on the Palo Alto surprised me a bit, as I was expecting it to default to active and enable fast timers, but this was easy to set:

paloalto_lacp_fast.png

Unfortunately during testing, it still took a good minute for failover to work.  This is because the standby unit disables interfaces until going active, so there’s a delay of 30-40 seconds for LACP bundling plus an additional 25-50 seconds for Spanning-Tree.  Working around Spanning-Tree was easy: just use Edge port aka PortFast.  Note it should be enabled at the channel level and ‘trunk’ must be added for it to work on trunk ports:

interface Port-channel4
 description Palo Alto Firewall - LACP
 switchport trunk encapsulation dot1q
 switchport mode trunk
 logging event trunk-status
 logging event bundle-status
 spanning-tree portfast trunk
!

Speeding up LACP took a bit more research.  Apparently, only data center grade Cisco switches like the Catalyst 6500 and Nexus line support LACP 1-second fast timers out of the box.  The Catalyst 3750 however will support fast timers on the bleeding edge 15.2(4)E train.

Upon testing, the failover downtime due to LACP bundling is now under 10 seconds:

Jul 20 17:58:22 PST: %EC-5-UNBUNDLE: Interface Gi4/1/1 left the port-channel Po31
Jul 20 17:58:22 PST: %EC-5-UNBUNDLE: Interface Gi3/1/1 left the port-channel Po31
Jul 20 17:58:30 PST: %EC-5-BUNDLE: Interface Gi3/1/2 joined port-channel Po32
Jul 20 17:58:32 PST: %EC-5-BUNDLE: Interface Gi4/1/2 joined port-channel Po32

 

 

 

Internal NICs on UCS E-Series Servers

General rules:

  1. The router’s ‘ucse1/0/0 + ucse1/0/1’ ports will map to GE0 and GE1 on the blade
  2. Bridging must be used to join the physical interfaces to the same broadcast domain
  3. Separate bridges must be used for each VLAN
  4. To use a native vlan, use “encapsulation untagged”

In our case we wanted to use tagging in the connection to the switch, but put the blade’s GE0 and GE1 interfaces on vlan 123 without tagging:

bridge-domain 123
!
interface Port-channel1
 no ip address
 no negotiation auto
 service instance 123 ethernet
 encapsulation dot1q 123
 rewrite ingress tag pop 1 symmetric
 snmp ifindex persist
 bridge-domain 123
 !
interface GigabitEthernet0/0/1
 description Switch:1
 no ip address
 negotiation auto
 channel-group 1 mode active
interface GigabitEthernet0/0/2
 description Switch:2
 no ip address
 negotiation auto
 channel-group 1 mode active
!
interface ucse1/0/0
 description UCS E-Series Blade:GE0
 no ip address
 no negotiation auto
 switchport mode trunk
 service instance 123 ethernet
 encapsulation untagged
 bridge-domain 123
 !
interface ucse1/0/1
 description UCS E-Series Blade:GE1
 no ip address
 no negotiation auto
 switchport mode trunk
 service instance 123 ethernet
 encapsulation untagged
 bridge-domain 123
 !

 

How many 3750-E switches can an RPS 2300 backup?

The Q&A sheet mentions that with dual 1150W power supplies, an RPS can backup 1 or 2 3750-E switches.  This is assuming the switches also have 1150W power supplies installed.

But what if they’re only using 750W?  The RPS would have a total of 2300W of power, while three of the switches would only require 2250W.    So it should be able to backup three switches, right?

Nope.  You can only backup two.  So, there’s actually really no point of installing 1150W power supplies in the RPS.

Switch#show env rps
DCOut State Connected Priority BackingUp WillBackup Portname SW#
----- ------- --------- -------- --------- ---------- --------------- ---
 1 Active Yes 6 Yes Yes FDO1525Y1T5 1
 2 Active No 6 No No <> -
 3 Active Yes 6 Yes Yes FDO1417R07E 3
 4 Active No 6 No No <> -
 5 Active Yes 6 No No FDO1406R1KU 5
 6 Active No 6 No No <> -

Yet another reason why the RPS sucks and StackPower on the 3750X and 3850 series is so much better.

Wimpy Buffers on Cisco 3750/3560 switches

Pretty much anyone who’s worked with Cisco switches is familiar with the 3750 series and its sister series, the 3560.  These switches started out as 100Mb some 15 years ago, went to Gigabit with the G series , 10 Gb with the E series, and finally 10 Gb SFPs and StackPower with the X series in 2010.  In 2013, the 3560 and 3750 series rather abruptly went end of sale, in favor of the 3650 and 3850 series, respectively.  Cisco did however continue to sell their lower-end cousin, the Layer 2 only 2960 series.

3560 & 3750s are deployed most commonly in campus and enterprise wiring closets, but it’s not uncommon to see them as top of rack switches in the data center.  The 3750s are especially popular in this regard because they’re stackable.  In addition to managing multiple switches via a single IP, they can connect to the core/distribution layer via aggregate uplinks, which saves cabling mess and port cost.

Unfortunately, I was reminded recently the 3750s come with a huge caveat: small buffer sizes.  What’s really shocking is as Cisco added horsepower in terms of bps and pps with the E, and X series, they kept the buffer sizes exactly the same: 2MB per 24 ports.  In comparison, a WS-X6748-GE-TX blade on a 6500 has 1.3 MB per port. That’s about 20x as much.  When a 3750 is handling high bandwidth flows, you’ll almost always see output queue drops:

 

Switch#show mls qos int gi1/0/1 stat
  cos: outgoing 
-------------------------------

  0 -  4 :  3599026173            0            0            0            0  
  5 -  7 :           0            0      2867623  
  output queues enqueued: 
 queue:    threshold1   threshold2   threshold3
-----------------------------------------------
 queue 0:           0           0           0 
 queue 1:  3599026173           0     2867623 
 queue 2:           0           0           0 
 queue 3:           0           0           0 

  output queues dropped: 
 queue:    threshold1   threshold2   threshold3
-----------------------------------------------
 queue 0:           0           0           0 
 queue 1:    29864113           0         171 
 queue 2:           0           0           0 
 queue 3:           0           0           0 

There is a partial workaround for this shortcoming: enabling QoS and tinkering with queue settings.  When enabling QoS, the input queue goes 90/10 while the output queue goes 25/25/25/25.  If the majority of traffic is CoS 0 (which is normal for a data center), the buffer settings for output queue #2 can be pushed way up.

mls qos queue-set output 1 threshold 2 3200 3200 50 3200
mls qos queue-set output 1 buffers 5 80 5 10
mls qos

Note here that queue-set 1 is the “default” set applied to all ports.  If you want to do some experimentation first, modify queue-set 2 and apply this to a test port with the “queue-set 2” command.  Also note that while the queues are called 1-2-3-4 in configuration mode, they’ll show up as 0-1-2-3 respectively in the show commands.  So, clearly the team writing the configuration and writing the show output weren’t on the same page.  That’s Cisco for you.

Bottom line: don’t expect more than 200 Mbps per port when deploying a 3560 or 3750 to a server farm.  I’m able to work with them for now, but will probably have to look at something beefier long term.  Since we have Nexus 5548s and 5672s at the distribution layer, migrating to the Nexus 2248 fabric extenders is the natural path here.  I have worked with the 4948s in the past but was never a big fan due to the high cost and non-stackability.  End of row 6500 has always been my ideal deployment scenario for a Data Center, but the reality is sysadmins love top of rack because they see it as “plug-n-play”, and ironically fall under the misconception that having a dedicated switch makes over-subscription less likely.

Why EtherChannel Misconfig Guard is a disaster in mixed vendor STP environments

large-marge-01

“It was the worst network outage…I ever seen”

I got bit by probably the worst outage ever this weekend.   Chain of events:

  1. F5 BigIP is plugged in to Cisco 3750 switch with default configuration
  2. Thanks to DTP, trunk is formed and STP TCN on vlan 1 is generated
  3. R-PVST+ BPDU is sent to pair of Cisco 6509 core switches
  4. Core switches each pass the BPDU to a stack of Dell/Force10 S4810s, which are running 802.1w, connected dual via 2×10 Gbps LACP etherchannels
  5. The Dells pass the BPDUs to their access switches, which are all connected via 2×10 Gbps LACP etherchannel.
  6. Converting the BPDUs back to R-PVST+, every single Cisco in the data center sees the “duplicate” (note the quotes here) BPDU come across the channel with different bridge IDs, concludes the etherchannel is misconfigured, and disables its 2x10Gbps uplink

Ouch.  Ouch.  Ouch.  Ouch.

While note the longest outage I’ve experienced, this probably has been the worst.  There’s lots of things to go about fixing, namely DTP should be disabled on 3750s with “switchport mode access” for all access layer ports.  But for the main culprit, it really looks like Cisco’s etherchannel misconfigation guard need some re-examination and should be used with extreme caution in mixed environments.

At a minimum, enable error recovery, so that if it does kick, it’s only temporary:

errdisable recovery cause channel-misconfig (STP)

And lower the interval from the default of 5 minutes to the shortest time possible:

errdisable recovery interval 30

If running LACP or PAgP for all aggregate links, it’s fine to disable the feature altogether:

no spanning-tree etherchannel guard misconfig

This is because should there be a mismatch, it will be removed from the bundle and set to independent mode, at which point Spanning-Tree will kick in and make all well until the configuration can be corrected.

48 Port Copper Modules for 6500 Series

WS-X6148 & WS-X6548

  • No Jumbo Frame Support
  • 1MB Buffer per 8 ports (average 128KB per port)
  • Not supported with newest Supervisors (VS-S720-10G & VS-SUP2T-10G)

WS-X6148A & WS-X6748

  • Jumbo Frame Support
  • 2.67MB or 1.3MB Buffer per port
  • Supported across all Supervisors

In short, the 6148 and 6548s should only be used in low-traffic environments, with SUP32 and SUP720 (non VSS) Supervisor cards.  The 6148A and 6748 are for high-traffic environments, such as server farms