Cacti Device Templates for CheckPoint R81.10

I desperately need to get some graphs on connections for Checkpoint after being unable to activate the monitoring blade for a cloud deployment with a PAYG license. Good ol’ Cacti was the quickest way to do accomplish that.

Sample graphs:

Advertisement

Using GCP Ops Agent to view Squid Logs

The VMs were deployed via Terraform using instance templates, managed instance groups, and an internal TCP/UDP load balancer with a forwarding rule for port 3128. Debian 11 (Bullseye) was selected as the OS because it has a low memory footprint while still offering an nice pre-packaged version of Squid version 4.

The first problem is the older stackdriver agent isn’t compatible with Debian 11. So I had to install the newer one. I chose to just add these lines to my startup script, pulling the script directly from a bucket to avoid the requirement of Internet access:

gsutil cp gs://public-j5-org/add-google-cloud-ops-agent-repo.sh /tmp/
bash /tmp/add-google-cloud-ops-agent-repo.sh --also-install

After re-deploying the VMs, I ssh’d in and verified the Ops agent was installed and running:

sudo systemctl status google-cloud-ops-agent"*"

google-cloud-ops-agent-opentelemetry-collector.service - Google Cloud Ops Agent - Metrics Agent
     Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service; static)
     Active: active (running) since Fri 2023-02-10 22:18:17 UTC; 18min ago
    Process: 4317 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=otel -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY} (code=exited, status=0/>
   Main PID: 4350 (otelopscol)
      Tasks: 7 (limit: 1989)
     Memory: 45.7M
        CPU: 1.160s

After waiting a couple minutes, I still didn’t see anything, so I downloaded and ran their diagnostic script:

gsutil cp gs://public-j5-org/diagnose-agents.sh /tmp/ && bash /tmp/diagnose-agents.sh

This was confusing because while it didn’t show any errors, the actual log was dumped to disk in a sub-directory of /var/tmp/google-agents/. and did indicate a problem in the agent-info.txt file:

API Check - Result: FAIL, Error code: LogApiPermissionErr, Failure:
 Service account is missing the roles/logging.logWriter role., Solution: Add the roles/logging.logWriter role to the Google Cloud service account., Res
ource: https://cloud.google.com/stackdriver/docs/solutions/agents/ops-agent/authorization#create-service-account

And this made sense, because in order for Ops Agent to function, it needs these two IAM roles enabled for the service account:

  • Monitoring > Monitoring Metric Writer.
  • Logging > Logs Writer.

Here’s a Terraform snippet that will do that:

# Add required IAM permissions for Ops Agents
locals {
  roles = ["logging.logWriter", "monitoring.metricWriter"]
}
resource "google_project_iam_member" "default" {
  for_each = var.service_account_email != null ? toset(local.roles) : {}
  project  = var.project_id
  member   = "serviceAccount:${var.service_account_email}"
  role     = "roles/${each.value}"
}

Within a few minutes of adding these, data started showing up in the graphs.

EEM Script to Generate Show Tech & Auto Reboot a router

While working through my CSR1000v stability woes, I had the need to automatically generate a “show tech” and then reboot a router after an IP SLA failure was detected.  It seemed fairly easy but I could never get the show tech fully completed before the EMM script would stop running, and the reboot command never worked either.

Posting on Reddit paid off as user caught the problem: EEM scripts by default can only run for 20 seconds.  Since a “show tech” can take longer than this, the subsequent steps may never be processed.  The solution is increase the runtime to say 60 seconds to guarantee the show tech completes:

! Create and run IP SLA monitor to ping default gateway every 5 seconds
ip sla 1
 icmp-echo 10.0.0.1 source-interface GigabitEthernet1
 threshold 50
 timeout 250
 frequency 5
!
ip sla schedule 1 life forever start-time now
!
! Create track object that will mark down after 3 failures
track 1 ip sla 1
 delay down 15 up 30
!
! Create EMM script to take action when track state is down
event manager session cli username "ec2-user"
event manager applet GatewayDown authorization bypass
 event track 1 state down maxrun 60
  action 100 cli command "en"
  action 101 cli command "term len 0"
  action 110 syslog priority notifications msg "Interface Gi1 stopped passing traffic. Generating diag info"
  action 300 cli command "delete /force bootflash:sh_tech.txt"
  action 350 cli command "show tech-support | redirect bootflash:sh_tech.txt"
  action 400 syslog priority alerts msg "Show tech completed. Rebooting now!"
  action 450 wait 5
  action 500 reload

Monitoring CPU & Memory in IOS-XE

ios-xe_cpu

One important thing to understanding in IOS-XE is the different numbers that can be returned when checking CPU and memory statistics.  There’s some very down in the weeds docs on this, but the simplest way to break it down is process vs. platform.  Processes is essentially control plane, while platform is data plane.

CPU

Processor CPU

CLI command: show processes cpu

SNMP OIDs:

1.3.6.1.4.1.9.2.1.56.0 = 5 second
1.3.6.1.4.1.9.2.1.57.0 = 1 minute
1.3.6.1.4.1.9.2.1.58.0 = 5 minute

Platform CPU

CLI command: show processes cpu platform

SNMP OIDs:

1.3.6.1.4.1.9.9.109.1.1.1.1.3.7 = 5 second
1.3.6.1.4.1.9.9.109.1.1.1.1.4.7 = 1 minute
1.3.6.1.4.1.9.9.109.1.1.1.1.5.7 = 5 minute

Note – Most platforms will be multi-core.

Memory

Processor Memory

CLI command: show processes memory

SNMP OIDs:

1.3.6.1.4.1.9.9.48.1.1.1.5.1 = Memory Used
1.3.6.1.4.1.9.9.48.1.1.1.6.1 = Memory Free

Platform Memory

CLI command: show platform resources

SNMP OIDs:

1.3.6.1.4.1.9.9.109.1.1.1.1.12.7 = Memory Used
1.3.6.1.4.1.9.9.109.1.1.1.1.13.7 = Memory Free
1.3.6.1.4.1.9.9.109.1.1.1.1.27.7 = Memory Committed

Cacti Templates

These were written for Cacti 0.8.8f

https://spaces.hightail.com/space/FoUD1PvlXA

 

Cacti 1.0 to 1.1 upgrade: MySQL TimeZone Database is not populated

Give the cacti user permission to read the internal MySQL table for time zone names:

[j5@linux ~]$ mysql -u root -p mysql
mysql> grant select on mysql.time_zone_name to cactiuser@'%';
Query OK, 0 rows affected (0.00 sec)

mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)

mysql> quit

To populate MySQL with some Timezone information:

[j5@linux ~]$ mysql -u root -p mysql < /usr/share/mysql/mysql_test_data_timezone.sql 
Enter password:

Now there’s at least some stuff there:

mysql> select * from time_zone_name;
+--------------------+--------------+
| Name | Time_zone_id |
+--------------------+--------------+
| MET | 1 |
| UTC | 2 |
| Universal | 2 |
| Europe/Moscow | 3 |
| leap/Europe/Moscow | 4 |
| Japan | 5 |
+--------------------+--------------+
6 rows in set (0.00 sec)

 

Cacti: MySQL table is marked as crashed and should be repaired

Had to do several reboots of the Cacti VM tonight to do some NFS mount fixes, and noticed graphs weren’t updating and the device list was returning zero rows.  Immediately my thought was database, and this was confirmed in cacti.log

2017-09-13 22:00:00 - DBCALL ERROR: SQL Assoc Failed!, Error:145, SQL:"SELECT status, COUNT(*) as cnt FROM `host` GROUP BY status"
2017-09-13 22:00:00 - DBCALL ERROR: SQL Assoc Failed!, Error: Table './cacti/host' is marked as crashed and should be repaired

Also in var/log/mysqld.log:

170913 22:03:00 [ERROR] /usr/libexec/mysqld: Table './cacti/host' is marked as crashed and should be repaired

This blog pointed me to the easy fix:

mysqlcheck -u cactiuser -p --auto-repair --databases cacti