Migrating a CheckPoint Management Server in GCP from R80.40 to R81.10

Here’s an outline of the process

  • Launch a new R81.10 VM and create /var/log/mdss.json with the hostname and new IP address
  • On the old R80.40 VM, perform an export (this will result in services being stopped for ~ 15 minutes)
  • On the new R81.10 VM, perform an import. This will take about 30 minutes
  • If using BYOL, re-issue the license with the new IP address

Performing Export on old R80.40 Server

On the old R80.40 server, in GAIA, navigate to Maintenance -> System Backups. If not done already, run a backup. This will give a rough idea of how long the export job will take and the approximate file size including logs.

So for me, the export size can be assumed to be just under 1.2 GB. Then go to CLI and enter expert mode. First, run migrate_server verify

expert

cd $FWDIR/scripts

./migrate_server verify -v R81.10
The verify operation finished successfully.

Now actually do the export. Mine took about 15 minutes and resulted in 1.1 GB file when including logs.

./migrate_server export -v R81.10 -l /var/log/export.tgz

The export operation will eventually stop all Check Point services (cpstop; cpwd_admin kill). Do you want to continue (yes/no) [n]? yes

Exporting the Management Database
Operation started at Thu Jan  5 16:20:33 UTC 2023

[==================================================] 100% Done

The export operation completed successfully. Do you wish to start Check Point services (yes/no) [y]? y
Starting Check Point services ...
The export operation finished successfully. 
Exported data to: /var/log/export.tgz.

Then copy the image to something offsite using SCP or SFTP.

ls -la /var/log/export.tgz 
-rw-rw---- 1 admin root 1125166179 Jan  5 17:36 /var/log/export.tgz

scp /var/log/export.tgz billy@10.1.2.6:

Setting up the new R81.10 Server

After launching the VM, SSH in and set an admin user password and expert mode password. Then save config:

set user admin password

set expert-password

save config

Login to the Web GUI and start the setup wizard. This is pretty must just clicking through a bunch of “Next” buttons. It is recommend to enable NTP though and uncheck “Gateway” if this is a management-only server.

When the setup wizard has concluded, download and install SmartConsole, then the latest Hotfix

One rebooted, login via CLI, go to expert mode, and create a /var/log/mdss.json file that has the name of the Management server (as it appears in SmartConsole) and the new server’s internal IP address. Mine looks like this:

[{"name":"checkpoint-mgr","newIpAddress4":"10.22.33.44"}]

It’s not a bad idea to paste this in to a JSON Validator to ensure the syntax is proper. Also note the square outer brackets, even though there’s only one entry in the array.

Importing the Database

Now we’re ready to copy the exported file from the R80.40 server. /var/log typically has the most room, so that’s a good location. Then run the import command. For me, this took around 20-30 minutes.

scp billy@10.1.2.6:export.tgz /var/log/

cd $FWDIR/scripts
./migrate_server import -v R81.10 -l /var/log/export.tgz

Importing the Management Database
Operation started at Thu Jan  5 16:51:22 GMT 2023

The import operation finished successfully.

If a “Failed to import” message appears, check the /var/log/mdss.json file again. Make sure the brackets, quotes, commas, and colons are in the proper place.

After giving the new server a reboot for good measure, login to CLI and verify services are up and running. Note it takes 2-3 minutes for the services to be fully running:

cd $FWDIR/scripts
./cpm_status.sh 
Check Point Security Management Server is during initialization

./cpm_status.sh 
Check Point Security Management Server is running and ready

I then tried to login via R81.10 SmartConsole and got this message:

This is expected. The /var/log/mdss.json only manages the connection to the gateways, it doesn’t have anything to do with licensing for the management server itself. And, I would guess that doing the import results in the 14 day trial license being overridden. Just to confirm that theory, I launched a PAYG VM, re-did the migration, and no longer saw this error.

Updating the Management Server License

Login to User Center -> Assets/Info -> Product Center, locate the license, change the IP address, and install the new license. Since SmartConsole won’t load, this must be done via CLI.

cplic put 10.22.33.44 never XXXXXXX

I then gave a reboot and waited 2-3 minutes for services to fully start. At this point, I was able to login to SmartConsole and see the gateways, but they all showed red. This is also expected – to make them green, policy must be installed.

I first did a database install for the management server itself (Menu -> Install Database), which was successful. Then tried a policy install on the gateways and got a surprise – the policy push failed, complaining of

From the Management Server, I tried a basic telnet test for port 18191 and it did indeed fail:

telnet 10.22.33.121 18191
Trying 10.22.33.121..

At first I thought the issue was firewall rules, but concluded that the port 18191 traffic was reaching the gateway but being rejected, which indicates a SIC issue. Sure enough, a quick Google pointed me to this:

Policy installation fails with “TCP connection failure port=18191

Indeed, the CheckPoint deployment template for GCP uses “member-a” and “member-b” as the hostname suffix for the gateways, but we give them a slightly different name in order to be consistent with our internal naming scheme.

The fix is change the hostname in the CLI to match the gateway name configured in SmartConsole:

cp-cluster-member-a> set hostname newhostname
cp-cluster-member-01> set domainname mydomain.org
cp-cluster-member-01> save config

After that, the telnet test to port 18191 was successful, and SmartConsole indicated some communication:

Now I have to reset SIC on both gateways:

cp-cluster-member-01> cpconfig
This program will let you re-configure
your Check Point products configuration.

Configuration Options:
----------------------
(1)  Licenses and contracts
(2)  SNMP Extension
(3)  PKCS#11 Token
(4)  Random Pool
(5)  Secure Internal Communication
(6)  Disable cluster membership for this gateway
(7)  Enable Check Point Per Virtual System State
(8)  Enable Check Point ClusterXL for Bridge Active/Standby
(9)  Hyper-Threading
(10) Check Point CoreXL
(11) Automatic start of Check Point Products

(12) Exit

Enter your choice (1-12) :5



Configuring Secure Internal Communication...
============================================
The Secure Internal Communication is used for authentication between
Check Point components

Trust State: Trust established

 Would you like re-initialize communication? (y/n) [n] ? y

Note: The Secure Internal Communication will be reset now,
and all Check Point Services will be stopped (cpstop).
No communication will be possible until you reset and
re-initialize the communication properly!
Are you sure? (y/n) [n] ? y
Enter Activation Key: 
Retype Activation Key: 
initial_module:
Compiled OK.
initial_module:
Compiled OK.

Hardening OS Security: Initial policy will be applied
until the first policy is installed

The Secure Internal Communication was successfully initialized

Configuration Options:
----------------------
(1)  Licenses and contracts
(2)  SNMP Extension
(3)  PKCS#11 Token
(4)  Random Pool
(5)  Secure Internal Communication
(6)  Disable cluster membership for this gateway
(7)  Enable Check Point Per Virtual System State
(8)  Enable Check Point ClusterXL for Bridge Active/Standby
(9)  Hyper-Threading
(10) Check Point CoreXL
(11) Automatic start of Check Point Products

(12) Exit

Enter your choice (1-12) :12

Thank You...
cpwd_admin: 
Process AUTOUPDATER terminated 
cpwd_admin: 
Process DASERVICE terminated 

The services will restart, which triggers a failover. At this point, I went in to Smart Console, edited the member, reset SIC, re-entered the key, and initialized. The policy pushes then were successful and everything was green. The last remaining issue was an older R80.30 cluster complaining of the IDS module not responding. This resolved itself the next day.

Advertisement

Re-sizing the Disk of a CheckPoint R80.40 Management Server in GCP

Breaking down the problem

As we enter the last year of support for CheckPoint R80.40, it’s time to finally get all management servers upgraded to R81.10 (if not done already). But I ran in to a problem when creating a snapshot on our management server in GCP:

This screen didn’t quite make sense because it says 6.69 GB are free, but the root partition actually shows 4.4 GB:

[Expert@chkpt-mgr:0]# df
Filesystem                      1K-blocks     Used Available Use% Mounted on
/dev/mapper/vg_splat-lv_current  20961280 16551092   4410188  79% /
/dev/sda1                          297485    27216    254909  10% /boot
tmpfs                             7572656     3856   7568800   1% /dev/shm
/dev/mapper/vg_splat-lv_log      45066752 27846176  17220576  62% /var/log

As it turns out, the 6 GB mentioned is completely un-partitioned space set aside for GAIA internals:

[Expert@chkpt-mgr:0]# lvm_manager -l

Select action:

1) View LVM storage overview
2) Resize lv_current/lv_log Logical Volume
3) Quit
Select action: 1

LVM overview
============
                  Size(GB)   Used(GB)   Configurable    Description         
    lv_current    20         16         yes             Check Point OS and products
    lv_log        43         27         yes             Logs volume         
    upgrade       22         N/A        no              Reserved for version upgrade
    swap          8          N/A        no              Swap volume size    
    free          6          N/A        no              Unused space        
    -------       ----                                                      
    total         99         N/A        no              Total size  

This explains why the disk space is always inadequate – 20 GB for root, 43 GB for log, 22 GB for “upgrade” (which can’t be used in GCP), 8 GB swap, and the remaining 6 GB set aide for snapshots (which is too small to be of use).

To create enough space for a snapshot we have only one solution: expand the disk size.

List of Steps

After first taking a Disk Snapshot of the disk in GCP, I followed these steps:

! On VM, in expert mode:
rm /etc/autogrow
shutdown -h now

! Use gcloud to increase disk size to 160 GB
gcloud compute disks resize my-vm-name --size 160 --zone us-central1-c

! Start VM up again
gcloud compute instances start my-vm-name --zone us-central1-c

After bootup, ran parted -l and verify partition #4 has been added:

Expert@ckpt:0]# parted -l

Model: Google PersistentDisk (scsi)
Disk /dev/sda: 172GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Disk Flags: 

Number  Start   End     Size    File system     Name       Flags
 1      17.4kB  315MB   315MB   ext3                       boot
 2      315MB   8902MB  8587MB  linux-swap(v1)
 3      8902MB  107GB   98.5GB                             lvm
 4      107GB   172GB   64.4GB                  Linux LVM  lvm


Model: Linux device-mapper (linear) (dm)
Disk /dev/mapper/vg_splat-lv_log: 46.2GB
Sector size (logical/physical): 512B/4096B
Partition Table: loop
Disk Flags: 

Number  Start  End     Size    File system  Flags
 1      0.00B  46.2GB  46.2GB  xfs


Model: Linux device-mapper (linear) (dm)
Disk /dev/mapper/vg_splat-lv_current: 21.5GB
Sector size (logical/physical): 512B/4096B
Partition Table: loop
Disk Flags: 

Number  Start  End     Size    File system  Flags
 1      0.00B  21.5GB  21.5GB  xfs

Then converted the partition to an empty volume and gave it to GAIA:

pvcreate /dev/sda4 -ff
vgextend vg_splat /dev/sda4

After all this, lvm_manager shows the free disk space is being seen:

[Expert@ckpt:0]# lvm_manager

Select action:

1) View LVM storage overview
2) Resize lv_current/lv_log Logical Volume
3) Quit

Select action: 1

LVM overview
============
                  Size(GB)   Used(GB)   Configurable    Description         
    lv_current    20         8          yes             Check Point OS and products
    lv_log        43         4          yes             Logs volume         
    upgrade       22         N/A        no              Reserved for version upgrade
    swap          8          N/A        no              Swap volume size    
    free          126        N/A        no              Unused space        
    -------       ----                                                      
    total         219        N/A        no              Total size 

Creating a snapshot in GAIA is no longer a problem:

Sorting IP addresses in Python

Trying to sort by IP addresses using their regular string values doesn’t work very well:

sorted(['192.168.1.1','192.168.2.1','192.168.11.1','192.168.12.1'])
['192.168.1.1', '192.168.11.1', '192.168.12.1', '192.168.2.1']

Good news is the solution isn’t difficult. Just use ip_address() to get the IP in integer format with a lambda:

import ipaddress

ips = ['192.168.1.1','192.168.2.1','192.168.11.1','192.168.12.1']

sorted(ips, key=lambda i: int(ipaddress.ip_address(i)))

Results in the following:

['192.168.1.1', '192.168.2.1', '192.168.11.1', '192.168.12.1']

In a more complex example where the data is in a list of dictionaries:

import ipaddress

ips = [
  {'address': "192.168.0.1"},
  {'address': "100.64.0.1"},
  {'address': "10.0.0.1"},
  {'address': "198.18.0.1"},
  {'address': "172.16.0.1"},
]

addresses = sorted(ips, key=lambda x: int(ipaddress.ip_address(x['address'])))

print(addresses)

Results in the following:

[{'address': '10.0.0.1'}, {'address': '100.64.0.1'}, {'address': '172.16.0.1'}, {'address': '192.168.0.1'}, {'address': '198.18.0.1'}]

To get the IPs with highest first, just add reverse=True to the sorted() call

Rancid: no matching key exchange method found. Their offer: diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha1

Time to move Rancid to a newer VM again, this time it’s Ubuntu 20. Hit a snag when I tried a test clogin run:

$ clogin myrouter
Unable to negotiate with 1.2.3.4 port 22: no matching key exchange method found.  Their offer: diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha1

OpenSSH removed SHA-1 from the defaults a while back, which makes sense since the migration to SHA-2 began several years ago. So looks like SSH is trying to use SHA-2 but the Cisco Router is defaulting to SHA-1, and something has to give in order for negotiation to succeed.

My first thought was to tell the Cisco router to use SHA-2, and this is possible for the MAC setting:

Router(config)#ip ssh server algorithm mac ?
  hmac-sha1      HMAC-SHA1 (digest length = key length = 160 bits)
  hmac-sha1-96   HMAC-SHA1-96 (digest length = 96 bits, key length = 160 bits)
  hmac-sha2-256  HMAC-SHA2-256 (digest length = 256 bits, key length = 256 bits)
  hmac-sha2-512  HMAC-SHA2-512 (digest length = 512 bits, key length = 512 bits

Router(config)#ip ssh server algorithm mac hmac-sha2-256 hmac-sha2-512
Router(config)#do sh ip ssh | inc MAC       
MAC Algorithms:hmac-sha2-256,hmac-sha2-512

But not for key exchange, which apparently only supports SHA-1:

Router(config)#ip ssh server algorithm kex ?
  diffie-hellman-group-exchange-sha1  DH_GRPX_SHA1 diffie-hellman key exchange algorithm
  diffie-hellman-group14-sha1         DH_GRP14_SHA1 diffie-hellman key exchange algorithm

Thus, the only option is to change the setting on the client. SSH has CLI options for Cipher and Mac:

-c : sets cipher (encryption) list.

-m: sets mac (authentication) list

But the option for Key Exchange can only be configured via the /etc/ssh/sshd_config file with this line:

KexAlgorithms +diffie-hellman-group14-sha1

I wanted to change the setting only for Rancid and not SSH in general, hoping that Cisco adds SHA-2 key exchange soon. I found out it is possible to set SSH options in the .cloginrc file. The solution is this:

add  sshcmd  *  {ssh\  -o\ KexAlgorithms=+diffie-
hellman-group14-sha1}

Clogin is now successful:

$ clogin myrouter
spawn ssh -oKexAlgorithms=+diffie-hellman-group14-sha1 -c aes128-ctr,aes128-cbc,3des-cbc -x -l myusername myrouter
Password:
Router#_

By the way, I stayed away from diffie-hellman-group-exchange-sha1 as it’s considered insecure, whereas diffie-hellman-group14-sha1 was considered deprecated but still widely deployed and still “strong enough”, probably thanks to its 2048-bit key length.

Sidenote: this only affects Cisco IOS-XE devices. The Cisco ASA ships with this in the default configuration:

ssh key-exchange group dh-group14-sha256

CheckPoint Dedicated Management Route

New feature (finally!) in R80.30 is the ability to enabled Management data plane Separation, in order to have a separate route table for the management interface and all management related functions (Policy installation, SSH, SNMP, syslog, GAIA portal, etc).

Let’s assume the interface “Mgmt” has already been set as the management interface with IP address 192.168.1.100 and wants default gateway 192.168.1.1, and “eth5” has been setup as the dedicated sync interface:

set mdps mgmt plane on
set mdps mgmt resource on
set mdps interface Mgmt management on
set mdps interface eth5 sync on
add mdps route 0.0.0.0/0 nexthop 192.168.1.1
save config
reboot

After the box comes up you can verify the management route has been set by going in to expert mode and the the “mplane” command to enter management space:

> expert
[Expert@MyCheckPoint:0]# mplane
Context set to Management Plane
[Expert@MyCheckPoint:1]# netstat -rn
Kernel IP routing table
Destination  Gateway       Genmask         Flags MSS Window irtt Iface
169.254.0.0  0.0.0.0       255.255.255.252 U     0   0      0    eth5
192.168.1.0  0.0.0.0       255.255.255.0   U     0   0      0    Mgmt
0.0.0.0      192.168.1.1   0.0.0.0         UGD   0   0      0    Mgmt

Routes from the main route table relating to management can then be deleted, which makes the data plane route table much cleaner:

[Expert@MyCheckpoint:1]# dplane
Context set to Data Plane

[Expert@MyCheckPoint:0]# netstat -rn
Kernel IP routing table
Destination   Gateway       Genmask         Flags MSS Window irtt Iface
203.0.113.32  0.0.0.0       255.255.255.224 U     0   0      0    bond1.11
192.168.222.0 0.0.0.0       255.255.255.0   U     0   0      0    bond1.22
0.0.0.0       203.0.113.33  0.0.0.0         UGD   0   0      0    bond1.11
192.168.0.0   192.168.222.1 255.255.0.0     UGD   0   0      0    bond1.22

Upgrading Checkpoint Management Server in AWS from R80.20 to R80.30

Unfortunately it is not possible to simply upgrade an existing CheckPoint management server in AWS.  A new one must be built, with the database manually exported from the old instance and imported to the new one.

There is a CheckPoint Knowledge base article, but I found it to have several errors and also be confusing on which version of tools should be used.

Below is the process I used to go from R80.20 to R80.30

Login to the old R80.20 server.  Download and extract the R80.30 tools:

cd /home/admin
tar -zxvf Check_Point_R80.30_Gaia_SecurePlatform_above_R75.40_Migration_tools.tgz

Run the export job to create an archive of the database:

./migrate export --exclude-licenses /tmp/R8020Backup.tgz

Copy this .tgz file to the new R80.30 management server in /tmp

On the new management server, run the import job:

cd $FWDIR/bin/upgrade_tools
./migrate import /tmp/R8020Backup.tgz 
The import operation will eventually stop all Check Point services (cpstop)
Do you want to continue? (y/n) [n]? y

After a few minutes, the operation will complete and you’ll be prompted to start services again.

Finish by upgrading SmartConsole to R80.30 and connect to the new R80.30 server.  I’ve noticed it to be very slow, but it will eventually connect and all the old gateways and policies will be there.

Cisco IOS-XE SCP Server with RADIUS authentication

I’ve been wanting to try out SCP to copy IOS images to routers for a while, as I figured it would be faster and cleaner than FTP/TFTP.  There’s essentially three tricks to getting it working..

  1. Having the correct AAA permissions
  2. Understanding the SCP syntax and file systems
  3. Making the scp command from the router VRF aware, if required
  4. 16.6.7 or 16.9.4 or newer code.  Performance on older IOS-XE versions is terrible

First, SSH has to be enabled and of course the SCP server must be activated

ip ssh version 2
ip scp server enable

After doing so, verify the router is accessible via SSH.  If not, try generating a fresh key:

Router(config)#crypto key generate rsa modulus 2048

Now on to the AAA configuration.  The important step is have accounts automatically go to their privilege level 15 without manually entering enable mode.  This is done with the “aaa authorization exec” command:

aaa new-model
!
username admin privilege 15 password 7 XXXXXXX
!
aaa group server radius MyRadiusServer
 server-private 10.1.1.100 auth-port 1812 acct-port 1813 key 7 XXXXXXXX
 ip vrf forwarding MyVRF
!
aaa authentication login default local group MyRadiusServer
aaa authentication enable default none
aaa authorization config-commands
aaa authorization exec default local group MyRadiusServer if-authenticated

The RADIUS server will also need this vendor-specific attribute in the policy:

Vendor: Cisco
Name: Cisco-AV-Pair
Value: priv-lvl=15

If I SSH to the router using a RADIUS account, I should automatically see enable mode:

$ ssh billy@10.1.1.1
Password: 
Router#show privilege
Current privilege level is 15

I can now upload IOS images to a router with IP address 10.1.1.1 like this:

scp csr1000v-universalk9.16.06.06.SPA.bin billy@10.1.1.1:bootflash:/csr1000v-universalk9.16.06.06.SPA.bin

If copying images from the router where the egress interface is on a VRF, the source interface must be specified:

ip ssh source-interface GigabitEthernet0

And simply use the IOS copy command:

copy scp://billy@10.1.1.2:/csr1000v-universalk9.16.06.06.SPA.bin bootflash:

Note scp’s performance in IOS-XE 16.6.5, was very poor, but excellent in 16.6.7 and 16.9.4

EEM Script to Generate Show Tech & Auto Reboot a router

While working through my CSR1000v stability woes, I had the need to automatically generate a “show tech” and then reboot a router after an IP SLA failure was detected.  It seemed fairly easy but I could never get the show tech fully completed before the EMM script would stop running, and the reboot command never worked either.

Posting on Reddit paid off as user caught the problem: EEM scripts by default can only run for 20 seconds.  Since a “show tech” can take longer than this, the subsequent steps may never be processed.  The solution is increase the runtime to say 60 seconds to guarantee the show tech completes:

! Create and run IP SLA monitor to ping default gateway every 5 seconds
ip sla 1
 icmp-echo 10.0.0.1 source-interface GigabitEthernet1
 threshold 50
 timeout 250
 frequency 5
!
ip sla schedule 1 life forever start-time now
!
! Create track object that will mark down after 3 failures
track 1 ip sla 1
 delay down 15 up 30
!
! Create EMM script to take action when track state is down
event manager session cli username "ec2-user"
event manager applet GatewayDown authorization bypass
 event track 1 state down maxrun 60
  action 100 cli command "en"
  action 101 cli command "term len 0"
  action 110 syslog priority notifications msg "Interface Gi1 stopped passing traffic. Generating diag info"
  action 300 cli command "delete /force bootflash:sh_tech.txt"
  action 350 cli command "show tech-support | redirect bootflash:sh_tech.txt"
  action 400 syslog priority alerts msg "Show tech completed. Rebooting now!"
  action 450 wait 5
  action 500 reload

Cisco ASA: Forcing local authentication for serial console

One of the root problems of administrative access to the ASA platform is there’s no easy way to bypass a broken AAA server

Cisco IOS has this:

aaa authentication enable default group radius none

But the ASA equivalent has no “none” option, so most people will configure this:

aaa authentication enable console RADIUS LOCAL

Now the problem here is if the user authenticates locally and the Radius server is still marked “up”, they’ll be forced to authenticate through it.  This creates two problematic scenarios

  1. The Radius server is reachable, but the username does not exist
  2. The Radius server is marked up but is actually unreachable, misconfigured, or horked in some way

The latter case occurred during our last two ASA outages.  It was especially frustrating because I had configured serial consoles to both ASAs, only to be unable to get to enable mode to force a reboot/failover and recover from the outage without having to drive to the data center.

A reddit user pointed me to this command:

aaa authorization exec LOCAL auto-enable

Which should in theory force accounts using local authentication to bypass the enable prompt assuming they’re set to priv 15.  But after having no luck with it and escalating through Cisco I discovered this command does not work with serial console logins.  So, I was back to square one.

The solution I settled on was to simply force local for both serial console authentication and enable mode:

aaa authentication serial console LOCAL
aaa authentication enable console LOCAL

Unfortunately the catch 22 revealed itself again, as this broke enable mode for Radius users, since they did not have local accounts.  So I added this line to try and bypass enable for Radius users:

aaa authentication ssh console RADIUS LOCAL
aaa authorization exec authentication-server auto-enable

Now I see them passing authentication on the Radius server, but the ASA rejecting them with this error:

%ASA-3-113021: Attempted console login failed user 'bob' did NOT have appropriate Admin Rights.

I had already configured priv-lvl=15 in the Radius server’s policy, so not sure what else it could need.  Turns out it also needed this attribute set:

Service-Type: Administrative

After this, now everything is happy.  SSH users get auto-enabled via RADIUS and can still fallback to local (in theory) if the server is down.  But if that’s broken, I can console in with a local username/password and enter enable mode.