The last 6 months of my life have been living hell thanks to Cisco.  I was assigned a relatively simple project of configuring some CSR1000v virtual routers in Amazon Web Services for the purpose of DMVPN spoke termination.  It has been miserable thanks to the unprecedented number of bugs on a product that’s over 5 years old.

Adding to the confusion, the primary issue is specific to the t2.medium instance type.  Since these are by far the cheapest instances and support up to 250 Mbps throughput (more than adequate for most organizations) I suspect I’m not the first to encounter problems, but am the first to find the root cause.

Problem #1 – They crash!

We launched the first batch of CSR1000vs in January 2019.  I had meant to tell the person launching them to try and do very 16.6 since that was the company standard for IOS-XE, but they defaulted to 16.9.1.  However, everything seemed fine as basic configuration and connectivity tests were successful.  But a few days after launching some test instances, I noticed the routers had rebooted unexpectedly.   Fortunately this was easy to diagnose since there were crash files I could provide TAC.  The bug was this:

CSCvk42631 – CSR1000v constantly crashes when processing packets larger than 9K

Since most instances launched in AWS have a 9001 Byte MTU by default, going with this version was a non-starter.  No problem, 16.9.2 was already available so that’s an easy fix….or so I thought.

Problem #2 – They stop passing traffic

I copied over 16.9.2 and 16.6.5 software images.  Initially I just booted 16.9.2 to verify they no longer crashed.  Welp, now we have a different problem.

The systems team said they had lost connectivity in the middle of the day.  I tried logging in to the CSR1000v via its external internal and could not access it.  However, I was able to access the peer router and then hop in via the private interface.  I saw that the external interface had “no ip address assigned”, which would indicated it lost connectivity to the DHCP server and failed renewal.

ddxAround this time I started scanning Reddit.  A user mentioned trying to roll out CSR1000vs on t2.mediums with DMVPN configured and having similar stability woes.  Cisco was of little use, but Amazon support blamed it on CPU micro-bursts likely caused by route changes.  Switching the instances from t2.medium to c4.large resolved the issue.  This really didn’t make sense to me, but I could see cases where bursts were triggering a near exhaustion of CPU credits

I’m caught between a rock and a hard place now becauaseSo I had to re-upgrade to 16.9.2 and try and workaround the freeze-ups using IP SLA monitors and an EEM script to auto-reboot upon detection.

Update April 26th 2019: We have a bug ID!  CSCvp48213

Update May 24th 2019: We have a root cause.  It’s a problem related to Generic Receive Offload (GRO) being enabled at the Linux kernel level.  A fix should be coming in June/July.

Problem #3 – Internal traffic is fine, but Internet traffic is painfully slow

After working through the above two problems, I finally got time to try upgrading to 16.10.1b.  Surprisingly, this fixed the stability problem.  No more crashes, no more freeze-ups.  But upon doing more detailed testing, we found a new issue – when using the CSR1000v for internet access (via NAT), speeds were horribly slow.  I would get 125 KBps at best despite having verified the 100 Mbps license was installed and “platform hardware throughput level” was configured and enabled.

ubuntu@ip-10-13-22-161:~$ curl ftp://ftp.freebsd.org/pub/FreeBSD/releases/i386/i386/ISO-IMAGES/11.2/FreeBSD-11.2-RELEASE-i386-bootonly.iso > /dev/null
 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed
100 261M 100 261M 0 0 30733 0 2:28:55 2:28:55 --:--:-- 92708

Very much seemed like a licensing issue since 125 KBps = 1 Mbps which is the speed set in evaluation mode.  But speeds going over an IPSec VPN tunnel were up to 50 Mbps, which made no sense.

Opened another TAC case and got nowhere.  Once again I’m on my own, so I desperately started reading release notes and find this:

CSCvn52259 – CSR1Kv:Throughput/BW level not set correctly when upgrade to 16.10.1 .bin image with AWS/Azure PAYG
We’re using BYOL rather than PAYG, but this seemed very suspicions since I had indeed upgraded from 16.9.X to 16.10.1.  Sure enough, I do another upgrade to 16.11.1a and speeds now jump to 100 Mbps!

I would suspect this has something to do with Smart licensing, since it is mandatory in 16.10

I started looking at CloudWatch for any clues, and this brought me back to the discussion around T2 instances and CPU bursting.

t2Medium_CPUCredits

I have no idea why the CPU credits dropped 90%, but perhaps the throttling that kicks in when balance is exhausted may explain it. Tt’s easy to change the CPU credits to T2 unlimited, but even after a reboot, there was no behavioral change.  Speeds were just horrifically slow.

Starting from scratch, I re-launched t2.medium, c4.large, and c5.large instances and applied a very simple config on just NAT and static routes.  Then applied a 100 Mbps throughput license.  Testing from a m2.micro instance running Ubuntu 16.04, these were the speed test results I got:

# Test download via CSR1000v on t2.medium
ubuntu@ip-10-13-22-161:~$ curl ftp://ftp.freebsd.org/pub/FreeBSD/releases/i386/i386/ISO-IMAGES/11.2/FreeBSD-11.2-RELEASE-i386-bootonly.iso > /dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 261M 100 261M 0 0 30733 0 2:28:55 2:28:55 --:--:-- 92708

# Test download via CSR1000v on c4.large
ubuntu@ip-10-13-22-161:~$ curl ftp://ftp.freebsd.org/pub/FreeBSD/releases/i386/i386/ISO-IMAGES/11.2/FreeBSD-11.2-RELEASE-i386-bootonly.iso > /dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 261M 100 261M 0 0 9993k 0 0:00:26 0:00:26 --:--:-- 11.2M

# Test download via CSR1000v on c5.large
ubuntu@ip-10-13-22-161:~$ curl ftp://ftp.freebsd.org/pub/FreeBSD/releases/i386/i386/ISO-IMAGES/11.2/FreeBSD-11.2-RELEASE-i386-bootonly.iso > /dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 261M 100 261M 0 0 10.4M 0 0:00:25 0:00:25 --:--:-- 11.2M

I now have yet another TAC case opened to diagnose the speeds but have a feeling this will be another 4-6 week ordeal to get it fixed.  So for now I’ll simply be re-deploying the CSR1000vs on c5.large instances (c5 is actually a bit cheaper than c4, probably due to being on the KVM hypervisor vs. Xen)

CSR1Kv:Throughput/BW level not set correctly when upgrade to 16.10.1 .bin image with AWS/Azure PAYG
CSCvn52259

https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvn52259

 

tviUezPkMB8zTzccFGi3M661t4vQJL25Ic5ytcPFMr0

 

For the last couple months I’ve been rolling out Cisco CSR1000v virtual routers in Amazon Web Services, as a way to connect virtual private clouds to the parent company Cisco DMVPN.  The first four routers were launched in January 2019 on software version 16.9.1.  I was a bit nervous using a bleeding edge version of IOS-XE, but they went fine and had zero issues.  Since the throughput requirement is not more than 100 Mbps, t2.medium instances were selected since they can supposedly do up to 250 Mbps.

We launched the second four in February, by which point 16.9.2 was the default version.  Everything seemed fine until DMVPN & EIGRP were configured.  A few days after doing so, two of the four running 16.9.2 had lost network connectivity on one of their interfaces.  Ping would fail, ARP table was present but nothing was reachable.  Shut/no shut the interface didn’t help.  AWS CloudWatch & CloudTrail didn’t show much, other than one of the router’s interfaces traffic suddenly dropping from some small amount of steady traffic (200 Kbps or so) to absolute zero.  The only solution: reboot the router.

Per usual, Cisco TAC was useless.  The routers had their IP addresses pulled via DHCP with the default 1-hour lease time, so after a outage, logs would show a DHCP renewal failure 1-59 minutes after the EIGRP neighbors dropped.  I switched to static IPs and saw the same behavior, but TAC continued to ask for DHCP debugging commands, seemingly unable to get it through their thick skulls that of course the DHCP renewal failing…because there was no connectivity.  In frustration, I attempted to open an entirely new case running static IPs, which they promptly classified as a duplicate case.

Thankfully, browsing Reddit once again paid off.  A user mentioned trying to roll out CSR1000vs on t2.mediums with DMVPN configured and having nothing but problems.  When inquired for details, he described being able to keep them online for more than a few hours, with the root cause being CPU spikes caused by route changes.  Per Amazon’s advise, he rebuilt the instances as c4.larges, and they were 100% stable.

Since I could only replicate this behavior in 16.9.2 and not 16.9.1, I concluded that software played a roll in this problem as well.  So I tried downgrading to 16.6.5, as that’s overall been the most stable version of IOS-XE I’ve worked with.  To my surprise, this made the problem much worse and the routers would loss connectivity in as little as 10 minutes uptime.  In desperation, I bumped it up to 16.7.3…and?  All was well.

It all clicked together when I read the IOS-XE 16.7 release notes:

Features—Cisco IOS XE Fuji 16.7.1

The following new software features are supported on the Cisco CSR 1000v for Cisco IOS XE Fuji 16.7.1.

  • Templates for vCPU allocation and distribution—including the Control Plane heavy template.
WELL THEN!!!!   Sure enough, while subtle, you can definitely see changes to CPU utilization when migrating from 16.6 to 16.7:csr1000v_cpu_166_vs_167.pngI’ve been running 16.7.3 on the t2.mediums for a full week now with absolutely no problems at all.Summary: Cisco’s claim that a t2.medium in AWS can do up to 250 Mbps, while accurate,  fails to take in to account control plane limitations.The crash was traced to bug

 

It makes sense this bug would be hit in any scenario where the CSR1000v is on the same subnet as an AWS instance, because by default the MTU of the instance interfaces is 9001 bytes:

As a work-around, they lowered MTU at the vNIC level in IOS-XE from 9238 Bytes in 16.9.1 to 1500 Bytes in 16.9.2:

Router#show platform software system all
VNIC Details
============
Name Mac Address Status Platform MTU
GigabitEthernet1 12bf.4f69.eb56 UP 1500
GigabitEthernet2 1201.c32b.1d16 UP 1500

So this would cause the CSR1000v to fragment packets greater than 1500 bytes.

Advertisements