Last year I started hearing more about TOML, which is a markup language (like YAML, JSON, and XML) which reminds me of configparser. The nice thing about TOML is the syntax and formatting very closely resemble Python and Terraform syntax, so it’s a very natural thing to use and is worthwhile learning
To install the package via PIP:
sudo pip3 install tomli
On FreeBSD, use a package:
pkg install py39-tomli
Create some basic TOML in a “test.toml” file:
[section_a]
key1 = "value_a1"
key2 = "value_a2"
[section_b]
name = "Frank"
age = 51
alive = true
[section_c]
pi = 3.14
Now some Python code to read it:
import tomli
from pprint import pprint
with open("test.toml", mode="rb") as fp:
pprint(tomli.load(fp))
[Expert@cp-member-a:0]# $FWDIR/scripts/google_ha_test.py
GCP HA TESTER: started
GCP HA TESTER: checking access scopes...
GCP HA TESTER: ERROR
Expecting value: line 1 column 1 (char 0)
Got this message when trying to test a CheckPoint R81.10 cluster build in a new environment. Obviously, this error message is not at all helpful in determining what the problem is. So I wrote a little debug script to try and isolate the issue:
import traceback
import gcp as _gcp
global api
api = _gcp.GCP('IAM', max_time=20)
metadata = api.metadata()[0]
project = metadata['project']['projectId']
zone = metadata['instance']['zone'].split('/')[-1]
name = metadata['instance']['name']
print("Got metadata: project = {}, zone = {}, name = {}\n".format(project, zone, name))
path = "/projects/{}/zones/{}/instances/{}".format(project, zone, name)
try:
head, res = api.rest("GET",path,query=None, body=None,aggregate=False)
except Exception as e:
print(traceback.format_exc())
Running the script, I now see an exception when trying to make the initial API call:
[Expert@cp-cluster-member-a:0]# cd $FWDIR/scripts
[Expert@cp-cluster-member-a:0]# python3 ./debug.py
Got metadata: project = myproject, zone = us-central1-b, name = cp-member-a
Traceback (most recent call last):
File "debug.py", line 18, in <module>
head, res = api.rest(method,path,query=None,body=None,aggregate=False)
File "/opt/CPsuite-R81.10/fw1/scripts/gcp.py", line 327, in rest
max_time=self.max_time, proxy=self.proxy)
File "/opt/CPsuite-R81.10/fw1/scripts/gcp.py", line 139, in http
headers['_code']), headers, repr(response))
gcp.HTTPException: Unexpected HTTP code: 403
This at least indicates the connection to the API is OK and it’s some type of permissions issue with the account.
The CheckPoints have always been really tough to troubleshoot in this aspect, so to keep it simple, I deploy them with the default service account for the project. It’s not explicitly called out
I was able to re-enabled Editor permissions for the default service account with this Terraform code:
# Set Project ID via input variable
variable "project_id" {
description = "GCP Project ID"
type = string
}
# Get the default service account info for this project
data "google_compute_default_service_account" "default" {
project = var.project_id
}
# Enable editor role for this service account
resource "google_project_iam_member" "default_service_account_editor" {
project = var.project_id
member = "serviceAccount:${data.google_compute_default_service_account.default.email}"
role = "roles/editor"
}
I have a script doing real-time log analysis, where about 25 log files are stored in a Google Cloud Storage bucket. The files are always small (1-5 MB each) but the script was taking over 10 seconds to run, resulting in slow page load times and poor user experience. Performance analysis showed that most of the time was spent on the storage calls, with high overhead of requesting individual files.
I started thinking the best way to improve performance was to make the storage calls in an async fashion so as to download the files in parallel. This would require a special library capable of making such calls; after lots of Googling and trial and error I found a StackOverFlow post which mentioned gcloud AIO Storage. This worked very well, and after implementation I’m seeing a 125% speed improvement!
Here’s a rundown of the steps I did to get async working with GCS.)
1) Install gcloud AIO Storage:
pip install gcloud-aio-storage
2) In the Python code, start with some imports
import asyncio
from gcloud.aio.auth import Token
from gcloud.aio.storage import Storage
3) Create a function to read multiples from the same bucket:
async def IngestLogs(bucket_name, file_names, key_file = None):
SCOPES = ["https://www.googleapis.com/auth/cloud-platform.read-only"]
token = Token(service_file=key_file, scopes=SCOPES)
async with Storage(token=token) as client:
tasks = (client.download(bucket_name, _) for _ in file_names)
blobs = await asyncio.gather(*tasks)
await token.close()
return blobs
It’s important to note that ‘blobs’ will be a list, with each element representing a binary version of the file.
4) Create some code to call the async function. The decode() function will convert each blob to a string.
def main():
bucket_name = "my-gcs-bucket"
file_names = {
'file1': "path1/file1.abc",
'file2': "path2/file2.def",
'file3': "path3/file3.ghi",
}
key = "myproject-123456-mykey.json"
blobs = asyncio.run(IngestLogs(bucket_name, file_names.values(), key_file=key))
for blob in blobs:
# Print the first line from each blob
print(blob.decode('utf-8')[0:75])
I track the load times via NewRelic synthetics, and it showed a 300% performance improvement!
Now you’re ready to write some Python code. Start with a couple imports:
#!/usr/bin/env python3
from googleapiclient import discovery
from google.oauth2 import service_account
By default, the default compute service account for the VM or AppEngine will be used for authentication. Alternately, a service account can be specific with the key’s JSON file:
All API calls require the project ID (not name) be provided as a parameter. I will set it like this:
PROJECT_ID = "myproject-1234"
With the connection to the API established, you can now run some commands. The resource object will have several methods, and in each there will typically be a list() method to list the items in the project. The execute() at the end is required to actually execute the call.
It’s important to note the list().execute() returns a dictionary. The actual list of items can be found in key ‘items’. I’ll use the get() method to retrieve the values for the ‘items’ key, or use an empty list if ‘items’ doesn’t exist. Here’s an example
firewall_rules = _.get('items', [])
print(len(firewall_rules), "firewall rules in project", PROJECT_ID)
for firewall_rule in firewall_rules:
print(" -", firewall_rule['name'])
The API reference guide has a complete list of everything that’s available. Here’s some examples:
Some calls will require the region name as a parameter. To get a list of all regions, this can be done:
_ = resource_object.regions().list(project=PROJECT_ID).execute()
regions = [region['name'] for region in _.get('items', [])]
Then iterate through each region. For example to list all subnets:
for region in regions:
_ = resource_object.subnetworks().list(project=PROJECT_ID,region=region).execute()
print("Reading subnets for region", region ,"...")
subnets = _.get('items', [])
for subnet in subnets:
print(" -", subnet['name'], subnet['ipCidrRange'])
As I explore different ways of doing Web programming in Python via different Frameworks, I kept finding the need to examine HTTP server variables, specifically the server hostname, path, and query string. The method to do this varies quite a bit by framework.
I’d heard about netaddr a few weeks ago and had made a note to start. What I learned today as a similar library called ipaddress is included in Python 3, and offers most of netaddr’s functionality just with different syntax.
It is very handy for subnetting. Here’s some basic code that takes the 198.18.128.0/18 CIDR block and splits in it to 4 /20s:
#!/usr/bin/env python
import ipaddress
cidr = "198.18.128.0/18"
subnet_size = "/20"
[network_address, prefix_len] = cidr.split('/')
power = int(subnet_size[1:]) - int(prefix_len)
subnets = list(ipaddress.ip_network(cidr).subnets(power))
print("{} splits in to {} {}s:".format(cidr, len(subnets), subnet_size))
for _ in range(len(subnets)):
print(" Subnet #{} = {}".format(_+1, subnets[_]))
I began finally migrating some old scripts from PHP to Python late last year, and while I was happy to finally have my PHP days behind me, I noticed the script execution was disappointing. On average, a Python CGI script would run 20-80% slower than an equivalent PHP script. At first I chalked it up to slower libraries, but even basic ones that didn’t rely on database or anything fancy still seemed to be incurring a performance hit.
Yesterday I happened to come across mention of WSGI, which is essentially a Python-specific replacement for CGI. I realized the overhead of CGI probably explained why my Python scripts were slower than PHP. So I wanted to give WSGI a spin and see if it could help.
Like PHP, WSGI is an Apache module that is not included in many pre-packaged versions. So first step is to install it.
On Debian/Ubuntu:
sudo apt-get install libapache2-mod-wsgi-py3
The install process should auto-activate the module.
cd /etc/apache2/mods-enabled/
ls -la wsgi*
lrwxrwxrwx 1 root root 27 Mar 23 22:13 wsgi.conf -> ../mods-available/wsgi.conf
lrwxrwxrwx 1 root root 27 Mar 23 22:13 wsgi.load -> ../mods-available/wsgi.load
On FreeBSD, the module does not get auto-activated and must be loaded via a config file:
sudo pkg install ap24-py37-mod_wsgi
# Create /usr/local/etc/apache24/Includes/wsgi.conf
# or similar, and add this line:
LoadModule wsgi_module libexec/apache24/mod_wsgi.so
Like CGI, the directory with the WSGI script will need special permissions. As a security best practice, it’s a good idea to have scripts located outside of any DocumentRoot, so the scripts can’t accidentally get served as plain files.
<Directory "/var/www/scripts">
Require all granted
</Directory>
As for the WSGI script itself, it’s similar to AWS Lambda, using a pre-defined function. However, it returns an array or bytes rather than a dictionary. Here’s a simple one that will just spit out the host, path, and query string as JSON:
The last step is route certain paths to WSGI script. This is done in the Apache VirtualHost configuration:
WSGIPythonPath /var/www/scripts
<VirtualHost *:80>
ServerName python.mydomain.com
ServerAdmin nobody@mydomain.com
DocumentRoot /home/www/html
Header set Access-Control-Allow-Origin: "*"
Header set Access-Control-Allow-Methods: "*"
Header set Access-Control-Allow-Headers: "Origin, X-Requested-With, Content-Type, Accept, Authorization"
WSGIScriptAlias /myapp /var/www/scripts/myapp.wsgi
</VirtualHost>
Upon migrating a test URL from CGI to WSGI, the page load time dropped significantly:
The improvement is thanks to a 50-90% reduction in “wait” and “receive” times, via ThousandEyes:
I’d next want to look at more advanced Python Web Frameworks like Flask, Bottle, WheezyWeb and Tornado. Django is of course a popular option too, but I know from experience it won’t be the fastest. Flask isn’t the fastest either, but it is the framework for Google SAE which I plan to learn after mastering AWS Lambda.
GCP HTTP/HTTPS Load Balancers offer great performance, and today I learned of a cool almost hidden feature: the ability to stamp custom headers with client GeoIP info. Here’s a Terraform config snippet:
This will cause the Backend Service to stamp all HTTP requests with a custom header called “X-Client-Geo-Location” with the country abbreviation and city. It can then be parsed on the server to get this information for the client without having to rely on messy X-Forwarded-For parsing and GeoIP lookups.
Here’s a Python example that redirects the user to UK or Australia localized websites:
#!/usr/bin/env python3
import os
try:
client_location = os.environ.get('HTTP_X_CLIENT_GEO_LOCATION', None)
if client_location:
[country,city] = client_location.split(',')
websites = { 'UK': "www.foo.co.uk", 'AU': "www.foo.au" }
if country in websites:
local_website = websites[country]
else:
local_website = "www.foo.com"
print("Status: 301\nLocation: https://{}\n".format(local_website))
except Exception as e:
print("Status: 500\nContent-Type: text/plain\n\n{}".format(e))