Google Cloud Internal HTTP(S) Load Balancers now have global access support

Previously, the envoy-based Internal HTTP(S) load balancers could only be accessed within the same region. For orgs that leverage multiple regions and perform cross-region traffic, this limitation was a real pain point, and not a problem for AWS ALBs. So, I’m glad to see it’s now offered:

Oddly, the radio button only shows up during the ILB creation. To modify an existing one, use this gcloud command:

gcloud compute forwarding-rules update NAME --allow-global-access

Or, in Terraform:

resource "google_compute_forwarding_rule" "default" {
  allow_global_access   = true
}

It’s also important to be aware that Global access on the HTTP(S) ILB must be enabled if accessing from another load balancer via PSC. If not, you’ll get this error message:

 Error 400: Invalid value for field 'resource.backends[0]': '{  "resourceGroup": "projects/myproject/regions/us-west1/networkEndpointGroups/psc-backend", ...'. Global L7 Private Service Connect consumers require the Private Service Connect producer load b
alancer to have AllowGlobalAccess enabled., invalid

Authenticating to Google Cloud Platform via OAuth2 with Python

For most of my troubleshooting tools, I want to avoid the security concerns that come with managing service accounts. Using my account also lets me access multiple projects. To do the authentication in Python, I’d originally installed google-api-python-client and then authenticated using credentials=None

from googleapiclient.discovery import build

try:
    resource_object = build('compute', 'v1', credentials=None)
except Exception as e:
    quit(e)

This call was a bit slow (2-3 seconds) and I was wondering if there was a faster way. The answer is ‘yes’ – just use OAuth2 tokens instead. Here’s how.

If not done already, generate a login session via this CLI command:

gcloud auth application-default login

You can then view its access token with this CLI command:

gcloud auth application-default print-access-token

You should see a string back that’s around 200 characters long. Now we’re ready to try this out with Python. First, install the oauth2client package:

pip3 install oauth2client

Now the actual python code to get that same access token:

from oauth2client.client import GoogleCredentials

try:
    creds = GoogleCredentials.get_application_default()
except Exception as e:
    quit(e)

print("Access Token:", creds.get_access_token().access_token)

This took around 150-300 ms to execute which is quite a bit faster and reasonable.

If using raw HTTP calls via requests, aiohttp, or http.client, set a header with ‘Authorization’ as the key and ‘Bearer <ACCESS_TOKEN>’ as the value.

Using an SMTP Smart Host with Sendmail on FreeBSD 12

If you’ve got an ISP that forces all outbound e-mail to go via their servers, the solution to this is called SMTP smart hosting, where one SMTP server uses another SMTP server as a relay or proxy, essentially acting like a Mail User Agent (client) than a Mail Transfer Agent (server). If running FreeBSD, the default mail server will be sendmail and it’s a little unclear how to set this up.

For me, I just decided to really start from scratch with a custom config file with the SMART_HOST setting. Here’s the file I created, saving it as /etc/mail/custom.mc:

divert(0)
VERSIONID(`$FreeBSD: releng/12.2/etc/sendmail/freebsd.mc 363465 2020-07-24 00:22:33Z gshapiro $')
OSTYPE(freebsd5)
FEATURE(access_db, `hash -o -T<TMPF> /etc/mail/access')
FEATURE(mailertable, hash -o /etc/mail/mailertable)
FEATURE(virtusertable, hash -o /etc/mail/virtusertable)
DOMAIN(generic)
DAEMON_OPTIONS(`Name=IPv4, Family=inet')
DAEMON_OPTIONS(`Name=IPv6, Family=inet6, Modifiers=O')
MASQUERADE_AS(`freebsd.mydomain.com)
FEATURE(`masquerade_envelope')
define(`SMART_HOST', `smtp.mydomain.com')
MAILER(local)
MAILER(smtp)

Then ran a few commands to build the config file using m4 and restart sendmail:

cd /etc/mail
cp sendmail.cf sendmail.cf.bak
m4 /usr/share/sendmail/cf/m4/cf.m4 custom.mc > sendmail.cf
touch local-host-names
/etc/rc.d/sendmail restart

Since I’m sending the e-mails via Python, I used this test script:

import smtplib

subject = "Test"
sender = "me@freebsd.mydomain.com"
recipient = "me@gmail.com"
smtp_host = "127.0.0.1"

message = f"From: {sender}\nTo: {recipient}\nSubject: {subject}\n\n"

try:
     server = smtplib.SMTP(smtp_host, port=25)
     server.ehlo()
     server.sendmail(sender, recipient, message)
     server.quit()
except Exception as e:
     quit(e)

Using GCP Ops Agent to view Squid Logs

The VMs were deployed via Terraform using instance templates, managed instance groups, and an internal TCP/UDP load balancer with a forwarding rule for port 3128. Debian 11 (Bullseye) was selected as the OS because it has a low memory footprint while still offering an nice pre-packaged version of Squid version 4.

The first problem is the older stackdriver agent isn’t compatible with Debian 11. So I had to install the newer one. I chose to just add these lines to my startup script, pulling the script directly from a bucket to avoid the requirement of Internet access:

gsutil cp gs://public-j5-org/add-google-cloud-ops-agent-repo.sh /tmp/
bash /tmp/add-google-cloud-ops-agent-repo.sh --also-install

After re-deploying the VMs, I ssh’d in and verified the Ops agent was installed and running:

sudo systemctl status google-cloud-ops-agent"*"

google-cloud-ops-agent-opentelemetry-collector.service - Google Cloud Ops Agent - Metrics Agent
     Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service; static)
     Active: active (running) since Fri 2023-02-10 22:18:17 UTC; 18min ago
    Process: 4317 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=otel -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY} (code=exited, status=0/>
   Main PID: 4350 (otelopscol)
      Tasks: 7 (limit: 1989)
     Memory: 45.7M
        CPU: 1.160s

After waiting a couple minutes, I still didn’t see anything, so I downloaded and ran their diagnostic script:

gsutil cp gs://public-j5-org/diagnose-agents.sh /tmp/ && bash /tmp/diagnose-agents.sh

This was confusing because while it didn’t show any errors, the actual log was dumped to disk in a sub-directory of /var/tmp/google-agents/. and did indicate a problem in the agent-info.txt file:

API Check - Result: FAIL, Error code: LogApiPermissionErr, Failure:
 Service account is missing the roles/logging.logWriter role., Solution: Add the roles/logging.logWriter role to the Google Cloud service account., Res
ource: https://cloud.google.com/stackdriver/docs/solutions/agents/ops-agent/authorization#create-service-account

And this made sense, because in order for Ops Agent to function, it needs these two IAM roles enabled for the service account:

  • Monitoring > Monitoring Metric Writer.
  • Logging > Logs Writer.

Here’s a Terraform snippet that will do that:

# Add required IAM permissions for Ops Agents
locals {
  roles = ["logging.logWriter", "monitoring.metricWriter"]
}
resource "google_project_iam_member" "default" {
  for_each = var.service_account_email != null ? toset(local.roles) : {}
  project  = var.project_id
  member   = "serviceAccount:${var.service_account_email}"
  role     = "roles/${each.value}"
}

Within a few minutes of adding these, data started showing up in the graphs.

Migrating a CheckPoint Management Server in GCP from R80.40 to R81.10

Here’s an outline of the process

  • Launch a new R81.10 VM and create /var/log/mdss.json with the hostname and new IP address
  • On the old R80.40 VM, perform an export (this will result in services being stopped for ~ 15 minutes)
  • On the new R81.10 VM, perform an import. This will take about 30 minutes
  • If using BYOL, re-issue the license with the new IP address

Performing Export on old R80.40 Server

On the old R80.40 server, in GAIA, navigate to Maintenance -> System Backups. If not done already, run a backup. This will give a rough idea of how long the export job will take and the approximate file size including logs.

So for me, the export size can be assumed to be just under 1.2 GB. Then go to CLI and enter expert mode. First, run migrate_server verify

expert

cd $FWDIR/scripts

./migrate_server verify -v R81.10
The verify operation finished successfully.

Now actually do the export. Mine took about 15 minutes and resulted in 1.1 GB file when including logs.

./migrate_server export -v R81.10 -l /var/log/export.tgz

The export operation will eventually stop all Check Point services (cpstop; cpwd_admin kill). Do you want to continue (yes/no) [n]? yes

Exporting the Management Database
Operation started at Thu Jan  5 16:20:33 UTC 2023

[==================================================] 100% Done

The export operation completed successfully. Do you wish to start Check Point services (yes/no) [y]? y
Starting Check Point services ...
The export operation finished successfully. 
Exported data to: /var/log/export.tgz.

Then copy the image to something offsite using SCP or SFTP.

ls -la /var/log/export.tgz 
-rw-rw---- 1 admin root 1125166179 Jan  5 17:36 /var/log/export.tgz

scp /var/log/export.tgz billy@10.1.2.6:

Setting up the new R81.10 Server

After launching the VM, SSH in and set an admin user password and expert mode password. Then save config:

set user admin password

set expert-password

save config

Login to the Web GUI and start the setup wizard. This is pretty must just clicking through a bunch of “Next” buttons. It is recommend to enable NTP though and uncheck “Gateway” if this is a management-only server.

When the setup wizard has concluded, download and install SmartConsole, then the latest Hotfix

One rebooted, login via CLI, go to expert mode, and create a /var/log/mdss.json file that has the name of the Management server (as it appears in SmartConsole) and the new server’s internal IP address. Mine looks like this:

[{"name":"checkpoint-mgr","newIpAddress4":"10.22.33.44"}]

It’s not a bad idea to paste this in to a JSON Validator to ensure the syntax is proper. Also note the square outer brackets, even though there’s only one entry in the array.

Importing the Database

Now we’re ready to copy the exported file from the R80.40 server. /var/log typically has the most room, so that’s a good location. Then run the import command. For me, this took around 20-30 minutes.

scp billy@10.1.2.6:export.tgz /var/log/

cd $FWDIR/scripts
./migrate_server import -v R81.10 -l /var/log/export.tgz

Importing the Management Database
Operation started at Thu Jan  5 16:51:22 GMT 2023

The import operation finished successfully.

If a “Failed to import” message appears, check the /var/log/mdss.json file again. Make sure the brackets, quotes, commas, and colons are in the proper place.

After giving the new server a reboot for good measure, login to CLI and verify services are up and running. Note it takes 2-3 minutes for the services to be fully running:

cd $FWDIR/scripts
./cpm_status.sh 
Check Point Security Management Server is during initialization

./cpm_status.sh 
Check Point Security Management Server is running and ready

I then tried to login via R81.10 SmartConsole and got this message:

This is expected. The /var/log/mdss.json only manages the connection to the gateways, it doesn’t have anything to do with licensing for the management server itself. And, I would guess that doing the import results in the 14 day trial license being overridden. Just to confirm that theory, I launched a PAYG VM, re-did the migration, and no longer saw this error.

Updating the Management Server License

Login to User Center -> Assets/Info -> Product Center, locate the license, change the IP address, and install the new license. Since SmartConsole won’t load, this must be done via CLI.

cplic put 10.22.33.44 never XXXXXXX

I then gave a reboot and waited 2-3 minutes for services to fully start. At this point, I was able to login to SmartConsole and see the gateways, but they all showed red. This is also expected – to make them green, policy must be installed.

I first did a database install for the management server itself (Menu -> Install Database), which was successful. Then tried a policy install on the gateways and got a surprise – the policy push failed, complaining of a connection failure.

From the Management Server, I tried a basic telnet test for port 18191 and it did indeed fail:

telnet 10.22.33.121 18191
Trying 10.22.33.121..

At first I thought the issue was firewall rules, but concluded that the port 18191 traffic was reaching the gateway but being rejected, which indicates a SIC issue. Sure enough, a quick Google pointed me to this:

Policy installation fails with “TCP connection failure port=18191

Indeed, the CheckPoint deployment template for GCP uses “member-a” and “member-b” as the hostname suffix for the gateways, but we give them a slightly different name in order to be consistent with our internal naming scheme.

The fix is change the hostname in the CLI to match the gateway name configured in SmartConsole:

cp-cluster-member-a> set hostname newhostname
cp-cluster-member-01> set domainname mydomain.org
cp-cluster-member-01> save config

After that, the telnet test to port 18191 was successful, and SmartConsole indicated some communication:

Looking long term, a better solution is just leave the cluster members with the default ‘member-a/b’ hostnames and configure SmartConsole to match.

Now I have to reset SIC on both gateways:

cp-cluster-member-01> cpconfig
This program will let you re-configure
your Check Point products configuration.

Configuration Options:
----------------------
(1)  Licenses and contracts
(2)  SNMP Extension
(3)  PKCS#11 Token
(4)  Random Pool
(5)  Secure Internal Communication
(6)  Disable cluster membership for this gateway
(7)  Enable Check Point Per Virtual System State
(8)  Enable Check Point ClusterXL for Bridge Active/Standby
(9)  Hyper-Threading
(10) Check Point CoreXL
(11) Automatic start of Check Point Products

(12) Exit

Enter your choice (1-12) :5



Configuring Secure Internal Communication...
============================================
The Secure Internal Communication is used for authentication between
Check Point components

Trust State: Trust established

 Would you like re-initialize communication? (y/n) [n] ? y

Note: The Secure Internal Communication will be reset now,
and all Check Point Services will be stopped (cpstop).
No communication will be possible until you reset and
re-initialize the communication properly!
Are you sure? (y/n) [n] ? y
Enter Activation Key: 
Retype Activation Key: 
initial_module:
Compiled OK.
initial_module:
Compiled OK.

Hardening OS Security: Initial policy will be applied
until the first policy is installed

The Secure Internal Communication was successfully initialized

Configuration Options:
----------------------
(1)  Licenses and contracts
(2)  SNMP Extension
(3)  PKCS#11 Token
(4)  Random Pool
(5)  Secure Internal Communication
(6)  Disable cluster membership for this gateway
(7)  Enable Check Point Per Virtual System State
(8)  Enable Check Point ClusterXL for Bridge Active/Standby
(9)  Hyper-Threading
(10) Check Point CoreXL
(11) Automatic start of Check Point Products

(12) Exit

Enter your choice (1-12) :12

Thank You...
cpwd_admin: 
Process AUTOUPDATER terminated 
cpwd_admin: 
Process DASERVICE terminated 

The services will restart, which triggers a failover. At this point, I went in to Smart Console, edited the member, reset SIC, re-entered the key, and initialized. The policy pushes then were successful and everything was green. The last remaining issue was an older R80.30 cluster complaining of the IDS module not responding. This resolved itself the next day.

Reading TOML in Python

Last year I started hearing more about TOML, which is a markup language (like YAML, JSON, and XML) which reminds me of configparser. The nice thing about TOML is the syntax and formatting very closely resemble Python and Terraform syntax, so it’s a very natural thing to use and is worthwhile learning

To install the package via PIP:

sudo pip3 install tomli

On FreeBSD, use a package:

pkg install py39-tomli

Create some basic TOML in a “test.toml” file:

[section_a]
key1 = "value_a1"
key2 = "value_a2"

[section_b]
name = "Frank"
age = 51
alive = true

[section_c]
pi = 3.14

Now some Python code to read it:

import tomli
from pprint import pprint

with open("test.toml", mode="rb") as fp:
    pprint(tomli.load(fp))

When run, produces the following:

{'section_a': {'key1': 'value_a1', 'key2': 'value_a2'},
 'section_b': {'age': 51, 'alive': True, 'name': 'Frank'},
 'section_c': {'pi': 3.14}}

Re-sizing the Disk of a CheckPoint R80.40 Management Server in GCP

Breaking down the problem

As we enter the last year of support for CheckPoint R80.40, it’s time to finally get all management servers upgraded to R81.10 (if not done already). But I ran in to a problem when creating a snapshot on our management server in GCP:

This screen didn’t quite make sense because it says 6.69 GB are free, but the root partition actually shows 4.4 GB:

[Expert@chkpt-mgr:0]# df
Filesystem                      1K-blocks     Used Available Use% Mounted on
/dev/mapper/vg_splat-lv_current  20961280 16551092   4410188  79% /
/dev/sda1                          297485    27216    254909  10% /boot
tmpfs                             7572656     3856   7568800   1% /dev/shm
/dev/mapper/vg_splat-lv_log      45066752 27846176  17220576  62% /var/log

As it turns out, the 6 GB mentioned is completely un-partitioned space set aside for GAIA internals:

[Expert@chkpt-mgr:0]# lvm_manager -l

Select action:

1) View LVM storage overview
2) Resize lv_current/lv_log Logical Volume
3) Quit
Select action: 1

LVM overview
============
                  Size(GB)   Used(GB)   Configurable    Description         
    lv_current    20         16         yes             Check Point OS and products
    lv_log        43         27         yes             Logs volume         
    upgrade       22         N/A        no              Reserved for version upgrade
    swap          8          N/A        no              Swap volume size    
    free          6          N/A        no              Unused space        
    -------       ----                                                      
    total         99         N/A        no              Total size  

This explains why the disk space is always inadequate – 20 GB for root, 43 GB for log, 22 GB for “upgrade” (which can’t be used in GCP), 8 GB swap, and the remaining 6 GB set aide for snapshots (which is too small to be of use).

To create enough space for a snapshot we have only one solution: expand the disk size.

List of Steps

After first taking a Disk Snapshot of the disk in GCP, I followed these steps:

! On VM, in expert mode:
rm /etc/autogrow
shutdown -h now

! Use gcloud to increase disk size to 160 GB
gcloud compute disks resize my-vm-name --size 160 --zone us-central1-c

! Start VM up again
gcloud compute instances start my-vm-name --zone us-central1-c

After bootup, ran parted -l and verify partition #4 has been added:

Expert@ckpt:0]# parted -l

Model: Google PersistentDisk (scsi)
Disk /dev/sda: 172GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Disk Flags: 

Number  Start   End     Size    File system     Name       Flags
 1      17.4kB  315MB   315MB   ext3                       boot
 2      315MB   8902MB  8587MB  linux-swap(v1)
 3      8902MB  107GB   98.5GB                             lvm
 4      107GB   172GB   64.4GB                  Linux LVM  lvm


Model: Linux device-mapper (linear) (dm)
Disk /dev/mapper/vg_splat-lv_log: 46.2GB
Sector size (logical/physical): 512B/4096B
Partition Table: loop
Disk Flags: 

Number  Start  End     Size    File system  Flags
 1      0.00B  46.2GB  46.2GB  xfs


Model: Linux device-mapper (linear) (dm)
Disk /dev/mapper/vg_splat-lv_current: 21.5GB
Sector size (logical/physical): 512B/4096B
Partition Table: loop
Disk Flags: 

Number  Start  End     Size    File system  Flags
 1      0.00B  21.5GB  21.5GB  xfs

Then converted the partition to an empty volume and gave it to GAIA:

pvcreate /dev/sda4 -ff
vgextend vg_splat /dev/sda4

After all this, lvm_manager shows the free disk space is being seen:

[Expert@ckpt:0]# lvm_manager

Select action:

1) View LVM storage overview
2) Resize lv_current/lv_log Logical Volume
3) Quit

Select action: 1

LVM overview
============
                  Size(GB)   Used(GB)   Configurable    Description         
    lv_current    20         8          yes             Check Point OS and products
    lv_log        43         4          yes             Logs volume         
    upgrade       22         N/A        no              Reserved for version upgrade
    swap          8          N/A        no              Swap volume size    
    free          126        N/A        no              Unused space        
    -------       ----                                                      
    total         219        N/A        no              Total size 

Creating a snapshot in GAIA is no longer a problem:

Sorting IP addresses in Python

Trying to sort by IP addresses using their regular string values doesn’t work very well:

sorted(['192.168.1.1','192.168.2.1','192.168.11.1','192.168.12.1'])
['192.168.1.1', '192.168.11.1', '192.168.12.1', '192.168.2.1']

Good news is the solution isn’t difficult. Just use ip_address() to get the IP in integer format with a lambda:

import ipaddress

ips = ['192.168.1.1','192.168.2.1','192.168.11.1','192.168.12.1']

sorted(ips, key=lambda i: int(ipaddress.ip_address(i)))

Results in the following:

['192.168.1.1', '192.168.2.1', '192.168.11.1', '192.168.12.1']

In a more complex example where the data is in a list of dictionaries:

import ipaddress

ips = [
  {'address': "192.168.0.1"},
  {'address': "100.64.0.1"},
  {'address': "10.0.0.1"},
  {'address': "198.18.0.1"},
  {'address': "172.16.0.1"},
]

addresses = sorted(ips, key=lambda x: int(ipaddress.ip_address(x['address'])))

print(addresses)

Results in the following:

[{'address': '10.0.0.1'}, {'address': '100.64.0.1'}, {'address': '172.16.0.1'}, {'address': '192.168.0.1'}, {'address': '198.18.0.1'}]

To get the IPs with highest first, just add reverse=True to the sorted() call

Benchmarking Ampere’s ARM CPU in Google Cloud Platform

While creating an instance today I noticed GCP offers ARM based CPUs made by Ampere, a company based in Santa Clara with a large office in Portland. The monthly cost runs about $30/mo for a single CPU with 4 GB RAM – a bit pricier than comparable N1, but slightly less than a comparable T2D, which is the ultra-fast AMD EPYC Milan platform.

Since I mostly run basic Debian packages and python scripts, CPU platform really wasn’t an issue, so I was curious to have a quick bake-off using a basic 16 thread sysbench test to mimic a light to moderate load. Here’s the results

t2a-standard-1

These are based on Ampere Altra and cost $29/mo in us-central1

CPU speed:
    events per second:  3438.95

General statistics:
    total time:                          10.0024s
    total number of events:              34401

Latency (ms):
         min:                                    0.28
         avg:                                    4.63
         max:                                   80.31
         95th percentile:                       59.99
         sum:                               159394.13

Threads fairness:
    events (avg/stddev):           2150.0625/4.94
    execution time (avg/stddev):   9.9621/0.03

t2d-standard-1

These are based on the new 3rd gen AMD Milan platform and cost $32/mo in us-central1

CPU speed:
    events per second:  3672.67

General statistics:
    total time:                          10.0027s
    total number of events:              36738

Latency (ms):
         min:                                    0.27
         avg:                                    4.34
         max:                                  100.28
         95th percentile:                       59.99
         sum:                               159498.26

Threads fairness:
    events (avg/stddev):           2296.1250/3.24
    execution time (avg/stddev):   9.9686/0.02

n1-standard-1

These are based on the older Intel Skylake platform and cost $25/mo in us-central1

Prime numbers limit: 10000

Initializing worker threads...

Threads started!

CPU speed:
    events per second:   913.60

General statistics:
    total time:                          10.0072s
    total number of events:              9144

Latency (ms):
         min:                                    1.08
         avg:                                   17.45
         max:                                   89.10
         95th percentile:                       61.08
         sum:                               159544.06

Threads fairness:
    events (avg/stddev):           571.5000/1.00
    execution time (avg/stddev):   9.9715/0.03

n2d-custom2-4096

These are based on 2nd generation AMD EPYC Rome and cost $44/mo in us-central1

CPU speed:
    events per second:  1623.41

General statistics:
    total time:                          10.0046s
    total number of events:              16243

Latency (ms):
         min:                                    0.89
         avg:                                    9.82
         max:                                   97.24
         95th percentile:                       29.19
         sum:                               159485.50

Threads fairness:
    events (avg/stddev):           1015.1875/3.13
    execution time (avg/stddev):   9.9678/0.02

n2-custom-2-4096

These are based in Intel Cascade Lake and cost $50/mo in us-central1

CPU speed:
    events per second:  1942.56

General statistics:
    total time:                          10.0036s
    total number of events:              19435

Latency (ms):
         min:                                    1.01
         avg:                                    8.21
         max:                                   57.04
         95th percentile:                       29.19
         sum:                               159499.92

Threads fairness:
    events (avg/stddev):           1214.6875/8.62
    execution time (avg/stddev):   9.9687/0.02

e2-medium

These are based on availability and have 1-2 shared CPU cores and cost $25/mo in us-central1

CPU speed:
    events per second:  1620.67

General statistics:
    total time:                          10.0055s
    total number of events:              16217

Latency (ms):
         min:                                    0.85
         avg:                                    9.84
         max:                                   65.18
         95th percentile:                       29.19
         sum:                               159647.07

Threads fairness:
    events (avg/stddev):           1013.5625/3.43
    execution time (avg/stddev):   9.9779/0.02

Summary

Amphere’s ARM CPUs offered slightly lower performance against the latest goodies from AMD. It did however beat it in the bang for buck ratio thanks to costing $1 less per month to run.

But, the key take away is both platforms completely blow away the older CPU platforms from Intel. Here’s some nice little charts visualizing the numbers.

Migrating Terraform State Files to Workspaces in an AWS S3 Bucket

Just as I did with GCP a few weeks ago, I needed to circle back and migrate my state files to a cloud storage bucket. This done mainly to centralize the storage location automatically and thus lower the chance of a state file loss or corruption.

Previously, I’d been separating the state files using the -state parameter. I then use a different input file and state file for each environment like this:

terraform apply -var-file=env1.tfvars -state=env1.tfstate
terraform apply -var-file=env2.tfvars -state=env2.tfstate
terraform apply -var-file=env3.tfvars -state=env3.tfstate

To instead store the state files in an AWS S3 bucket, create a backend.tf file with this content:

terraform {
  backend "s3" {
    bucket               = "my-bucket-name"
    workspace_key_prefix = "tf-state"
    key                  = "terraform.tfstate"
    region               = "us-west-1"
  }
}

This will use a bucket named ‘my-bucket-name’ in AWS region us-west-1. Each workspace will store its state file in tfstate/<WORKSPACE_NAME>/terraform.tfstate

Note: if workspace_key_prefix is not specified, the directory ‘env:‘ will be created and used.

Since the backend has changed, I have to run this:

terraform init -reconfigure

I then have to copy the local state files to the correct location that the workspace will be using. This is easiest done with the AWS CLI tool, which will automatically create the sub-directory if it doesn’t exist.

aws s3 cp env1.tfstate s3://my-bucket-name/tf-state/env1/terraform.tfstate
aws s3 cp env2.tfstate s3://my-bucket-name/tf-state/env2/terraform.tfstate
aws s3 cp env3.tfstate s3://my-bucket-name/tf-state/env3/terraform.tfstate

I then create a workspace for each state file:

$ terraform workspace new env1
Created and switched to workspace "env1"!

Now I’m ready to run the applies and verify state is matching input

$ terraform apply -var-file=env1.tfvars

No changes. Your infrastructure matches the configuration.

Terraform has compared your real infrastructure against your configuration and found no differences, so no changes are needed.

Apply complete! Resources: 0 added, 0 changed, 0 destroyed.

$ terraform workspace new env2
Created and switched to workspace "env2"!

$ terraform apply -var-file=env2.tfvars

No changes. Your infrastructure matches the configuration.

Terraform has compared your real infrastructure against your configuration and found no differences, so no changes are needed.

Apply complete! Resources: 0 added, 0 changed, 0 destroyed.

Doing it in the opposite order

An alternate way to do this migration is enable workspaces first, then migrate the backend to S3.

$ terraform workspace new env1
Created and switched to workspace "env1"!

$ mv env1.tfstate terraform.tfstate.d/env1/terraform.tfstate

$ terraform apply -var-file=env1.tfvars

No changes. Your infrastructure matches the configuration.

Terraform has compared your real infrastructure against your configuration and found no differences, so no changes are needed.

Apply complete! Resources: 0 added, 0 changed, 0 destroyed.

Then create the backend.tf file and run terraform init -reconfigure. You’ll then be prompted to move the state files to S3:

$ terraform init -reconfigure
Initializing modules...

Initializing the backend...
Do you want to migrate all workspaces to "s3"?

Enter a value: yes

$ terraform apply -var-file=env1.tfvars

No changes. Your infrastructure matches the configuration.

Terraform has compared your real infrastructure against your configuration and found no differences, so no changes are needed.

Apply complete! Resources: 0 added, 0 changed, 0 destroyed.

Either way, the state files have to be individually migrated to the storage bucket