For the last couple months I’ve been rolling out Cisco CSR1000v virtual routers in Amazon Web Services, as a way to connect virtual private clouds to the parent company Cisco DMVPN.  The first four routers were launched in January 2019 on software version 16.9.1.  I was a bit nervous using a bleeding edge version of IOS-XE, but they went fine and had zero issues.  Since the throughput requirement is not more than 100 Mbps, t2.medium instances were selected since they can supposedly do up to 250 Mbps.

We launched the second four in February, by which point 16.9.2 was the default version.  Everything seemed fine until DMVPN & EIGRP were configured.  A few days after doing so, two of the four running 16.9.2 had lost network connectivity on one of their interfaces.  Ping would fail, ARP table was present but nothing was reachable.  Shut/no shut the interface didn’t help.  AWS CloudWatch & CloudTrail didn’t show much, other than one of the router’s interfaces traffic suddenly dropping from some small amount of steady traffic (200 Kbps or so) to absolute zero.  The only solution: reboot the router.

Per usual, Cisco TAC was useless.  The routers had their IP addresses pulled via DHCP with the default 1-hour lease time, so after a outage, logs would show a DHCP renewal failure 1-59 minutes after the EIGRP neighbors dropped.  I switched to static IPs and saw the same behavior, but TAC continued to ask for DHCP debugging commands, seemingly unable to get it through their thick skulls that of course the DHCP renewal failing…because there was no connectivity.  In frustration, I attempted to open an entirely new case running static IPs, which they promptly classified as a duplicate case.

Thankfully, browsing Reddit once again paid off.  A user mentioned trying to roll out CSR1000vs on t2.mediums with DMVPN configured and having nothing but problems.  When inquired for details, he described being able to keep them online for more than a few hours, with the root cause being CPU spikes caused by route changes.  Per Amazon’s advise, he rebuilt the instances as c4.larges, and they were 100% stable.

Since I could only replicate this behavior in 16.9.2 and not 16.9.1, I concluded that software played a roll in this problem as well.  So I tried downgrading to 16.6.5, as that’s overall been the most stable version of IOS-XE I’ve worked with.  To my surprise, this made the problem much worse and the routers would loss connectivity in as little as 10 minutes uptime.  In desperation, I bumped it up to 16.7.3…and?  All was well.

It all clicked together when I read the IOS-XE 16.7 release notes:

Features—Cisco IOS XE Fuji 16.7.1

The following new software features are supported on the Cisco CSR 1000v for Cisco IOS XE Fuji 16.7.1.

  • Templates for vCPU allocation and distribution—including the Control Plane heavy template.
WELL THEN!!!!   Sure enough, while subtle, you can definitely see changes to CPU utilization when migrating from 16.6 to 16.7:csr1000v_cpu_166_vs_167.pngI’ve been running 16.7.3 on the t2.mediums for a full week now with absolutely no problems at all.

Summary: Cisco’s claim that a t2.medium in AWS can do up to 250 Mbps, while accurate,  fails to take in to account control plane limitations.

The crash was traced to bug

 

It makes sense this bug would be hit in any scenario where the CSR1000v is on the same subnet as an AWS instance, because by default the MTU of the instance interfaces is 9001 bytes:

As a work-around, they lowered MTU at the vNIC level in IOS-XE from 9238 Bytes in 16.9.1 to 1500 Bytes in 16.9.2:

Router#show platform software system all
VNIC Details
============
Name Mac Address Status Platform MTU
GigabitEthernet1 12bf.4f69.eb56 UP 1500
GigabitEthernet2 1201.c32b.1d16 UP 1500

So this would cause the CSR1000v to fragment packets greater than 1500 bytes.

Advertisements