A weird, ugly Error message when running google_ha_test.py

[Expert@cp-member-a:0]# $FWDIR/scripts/google_ha_test.py
GCP HA TESTER: started
GCP HA TESTER: checking access scopes...
GCP HA TESTER: ERROR 

Expecting value: line 1 column 1 (char 0)

Got this message when trying to test a CheckPoint R81.10 cluster build in a new environment. Obviously, this error message is not at all helpful in determining what the problem is. So I wrote a little debug script to try and isolate the issue:

import traceback
import gcp as _gcp 

global api
api = _gcp.GCP('IAM', max_time=20)
metadata = api.metadata()[0]

project = metadata['project']['projectId']
zone = metadata['instance']['zone'].split('/')[-1]
name = metadata['instance']['name']

print("Got metadata: project = {}, zone = {}, name = {}\n".format(project, zone, name))
path = "/projects/{}/zones/{}/instances/{}".format(project, zone, name)

try:
    head, res = api.rest("GET",path,query=None, body=None,aggregate=False)
except Exception as e:
    print(traceback.format_exc())

Running the script, I now see an exception when trying to make the initial API call:

[Expert@cp-cluster-member-a:0]# cd $FWDIR/scripts
[Expert@cp-cluster-member-a:0]# python3 ./debug.py

Got metadata: project = myproject, zone = us-central1-b, name = cp-member-a

Traceback (most recent call last):
  File "debug.py", line 18, in <module>
    head, res = api.rest(method,path,query=None,body=None,aggregate=False)
  File "/opt/CPsuite-R81.10/fw1/scripts/gcp.py", line 327, in rest
    max_time=self.max_time, proxy=self.proxy)
  File "/opt/CPsuite-R81.10/fw1/scripts/gcp.py", line 139, in http
    headers['_code']), headers, repr(response))
gcp.HTTPException: Unexpected HTTP code: 403

This at least indicates the connection to the API is OK and it’s some type of permissions issue with the account.

The CheckPoints have always been really tough to troubleshoot in this aspect, so to keep it simple, I deploy them with the default service account for the project. It’s not explicitly called out

I was able to re-enabled Editor permissions for the default service account with this Terraform code:

# Set Project ID via input variable
variable "project_id" {
  description = "GCP Project ID"
  type = string
}
# Get the default service account info for this project
data "google_compute_default_service_account" "default" {
  project = var.project_id
}
# Enable editor role for this service account
resource "google_project_iam_member" "default_service_account_editor" {
  project = var.project_id
  member  = "serviceAccount:${data.google_compute_default_service_account.default.email}"
  role    = "roles/editor"
}
Advertisement

Making Async Calls to Google Cloud Storage

I have a script doing real-time log analysis, where about 25 log files are stored in a Google Cloud Storage bucket. The files are always small (1-5 MB each) but the script was taking over 10 seconds to run, resulting in slow page load times and poor user experience. Performance analysis showed that most of the time was spent on the storage calls, with high overhead of requesting individual files.

I started thinking the best way to improve performance was to make the storage calls in an async fashion so as to download the files in parallel. This would require a special library capable of making such calls; after lots of Googling and trial and error I found a StackOverFlow post which mentioned gcloud AIO Storage. This worked very well, and after implementation I’m seeing a 125% speed improvement!

Here’s a rundown of the steps I did to get async working with GCS.)

1) Install gcloud AIO Storage:

pip install gcloud-aio-storage

2) In the Python code, start with some imports

import asyncio
from gcloud.aio.auth import Token
from gcloud.aio.storage import Storage

3) Create a function to read multiples from the same bucket:

async def IngestLogs(bucket_name, file_names, key_file = None):

    SCOPES = ["https://www.googleapis.com/auth/cloud-platform.read-only"]
    token = Token(service_file=key_file, scopes=SCOPES)
            
    async with Storage(token=token) as client:
        tasks = (client.download(bucket_name, _) for _ in file_names)
        blobs = await asyncio.gather(*tasks)
    await token.close()
                
    return blobs

It’s important to note that ‘blobs’ will be a list, with each element representing a binary version of the file.

4) Create some code to call the async function. The decode() function will convert each blob to a string.


def main():

    bucket_name = "my-gcs-bucket"
    file_names = {
       'file1': "path1/file1.abc",
       'file2': "path2/file2.def",
       'file3': "path3/file3.ghi",
    }
    key = "myproject-123456-mykey.json" 

    blobs = asyncio.run(IngestLogs(bucket_name, file_names.values(), key_file=key))

    for blob in blobs:
        # Print the first line from each blob
        print(blob.decode('utf-8')[0:75])

I track the load times via NewRelic synthetics, and it showed a 300% performance improvement!

Using GCP Python SDK for Network Tasks

Last week, I finally got around to hitting the GCP API directly using Python. It’s pretty easy to do in hindsight. Steps are below


If not done already, install PIP. On Debian 10, the command is this:

sudo apt install python3-pip

Then of course install the Python packages for GCP:

sudo pip3 install google-api-python-client google-cloud-storage

Now you’re ready to write some Python code. Start with a couple imports:

#!/usr/bin/env python3 

from googleapiclient import discovery
from google.oauth2 import service_account

By default, the default compute service account for the VM or AppEngine will be used for authentication. Alternately, a service account can be specific with the key’s JSON file:

KEY_FILE = '../mykey.json'
creds = service_account.Credentials.from_service_account_file(KEY_FILE)

Connecting to the Compute API will look like this. If using the default service account, the ‘credentials’ argument is not required.

resource_object = discovery.build('compute', 'v1', credentials=creds)

All API calls require the project ID (not name) be provided as a parameter. I will set it like this:

PROJECT_ID = "myproject-1234"

With the connection to the API established, you can now run some commands. The resource object will have several methods, and in each there will typically be a list() method to list the items in the project. The execute() at the end is required to actually execute the call.

_ = resource_object.firewalls().list(project=PROJECT_ID).execute()

It’s important to note the list().execute() returns a dictionary. The actual list of items can be found in key ‘items’. I’ll use the get() method to retrieve the values for the ‘items’ key, or use an empty list if ‘items’ doesn’t exist. Here’s an example

firewall_rules = _.get('items', [])
print(len(firewall_rules), "firewall rules in project", PROJECT_ID)
for firewall_rule in firewall_rules:
    print(" -", firewall_rule['name'])

The API reference guide has a complete list of everything that’s available. Here’s some examples:

firewalls() - List firewall rules
globalAddresses() - List all global addresses
healthChecks() - List load balancer health checks
subnetworks() - List subnets within a given region
vpnTunnels() - List configured VPN tunnels

Some calls will require the region name as a parameter. To get a list of all regions, this can be done:

_ = resource_object.regions().list(project=PROJECT_ID).execute()
regions = [region['name'] for region in _.get('items', [])]

Then iterate through each region. For example to list all subnets:

for region in regions:
    _ = resource_object.subnetworks().list(project=PROJECT_ID,region=region).execute()
    print("Reading subnets for region", region ,"...")
    subnets = _.get('items', [])
    for subnet in subnets:
        print(" -", subnet['name'], subnet['ipCidrRange'])

Getting web server variables and query parameters in different Python Frameworks

As I explore different ways of doing Web programming in Python via different Frameworks, I kept finding the need to examine HTTP server variables, specifically the server hostname, path, and query string. The method to do this varies quite a bit by framework.

Given the following URL: http://www.mydomain.com:8000/derp/test.py?name=Harry&occupation=Hit%20Man

I want to create the following variables with the following values:

  • server_host is ‘www.mydomain.com’
  • server_port is 8000
  • path is ‘/derp/test.py’
  • query_params is this dictionary: {‘name’: ‘Harry’, ‘occupation’: ‘Hit Man’}

Old School CGI

cgi.FieldStorage() is the easy way to do this, but it returns a list of tuples, and must be converted to a dictionary.

#!/usr/bin/env python3

if __name__ == "__main__":

    import os, cgi

    server_host = os.environ.get('HTTP_HOST', 'localhost')
    server_port = os.environ.get('SERVER_PORT', 80)
    path = os.environ.get('SCRIPT_URL', '/')
    query_params = {}
    _ = cgi.FieldStorage()
    for key in _:
        query_params[key] = str(_[key].value)

Note this will convert all values to strings. By default, cgi.FieldStorage() create numberic values as int or float.

WSGI

Similar to CGI, but environment variables get passed simply in a dictionary as the first parameter. There is no need to load the os module.

def application(environ, start_response):

    from urllib import parse

    server_host = environ.get('HTTP_HOST', 'localhost')
    server_port = environ.get('SERVER_PORT', 80)
    path = environ.get('SCRIPT_URL', '/')
    query_params = {}
    if '?' in environ.get('REQUEST_URI', '/'):
        query_params = dict(parse.parse_qsl(parse.urlsplit(environ['REQUEST_URI']).query))

Since the CGI Headers don’t exist, urllib.parse can be used to analyze the REQUEST_URI environment variable in order to form the dictionary.

Flask

Flask makes this very easy. The only real trick comes with path; the ‘/’ gets removed, so it must be re-added

from flask import Flask, request

app = Flask(__name__)

# Route all possible paths here
@app.route("/", defaults={"path": ""})
@app.route('/<string:path>')
@app.route("/<path:path>")

def index(path):
      
    [server_host, server_port] = request.host.split(':')
    path =  "/" + path
    query_params = request.args
 

FastAPI

This one’s a slightly different because the main variable to examine actually a QueryParams object with is a form of MultiDict

from fastapi import FastAPI, Request

app = FastAPI()

# Route all possible paths here
@app.get("/")
@app.get("/{path:path}")
def root(path, req: Request):

    [server_host, server_port] = req.headers['host'].split(':')
    path = "/" + path
    query_params = dict(req.query_params)

AWS Lambda

Lambda presents a dictionary called ‘event’ to the handler and it’s simply a matter of grabbing the right keys:

def lambda_handler(event, context):

    server_host = event['headers']['host']
    server_port = event['headers']['X-Forwarded-Port']
    path = event['path']
    query_params = event['queryStringParameters']

If multiValueheaders are enabled, some of the variables come in as lists, which in turn may have a list as values, even if there’s only one item.

    server_host = event['multiValueHeaders']['host'][0]
    query_params = {}
    for _ in event["multiValueQueryStringParameters"].items():
        query_params[_[0]] = _[1][0]

Getting started with Flask and deploying Python apps to Google App Engine

Installing Flask on Linux or Mac

On Debian 10 or Ubuntu 20:

sudo pip3 install flask flask-cors

On Mac or FreeBSD:

sudo pip install flask flask-cors

Creating a basic flask app:

#!/usr/bin/env python3

from flask import Flask, request, jsonify

app = Flask(__name__)

@app.route("/", defaults = {'path': ""})
@app.route("/<string:path>")
@app.route("/<path:path>")

def index(path):
    req_info = {
        'host': request.host.split(':')[0],
        'path': "/" + path,
        'query_string': request.args,
        'remote_addr': request.environ.get('HTTP_X_REAL_IP', request.remote_addr),
        'user_agent': request.user_agent.string
    }
    return jsonify(req_info)

if __name__ == '__main__':
    app.run()

Run the app

chmod u+x main.py
./main.py
Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)

Do a test curl against it

$ curl -v "http://localhost:5000/oh/snap?x=1&x=2"

< HTTP/1.0 200 OK
< Content-Type: application/json
< Content-Length: 65
< Access-Control-Allow-Origin: *
< Server: Werkzeug/1.0.1 Python/3.7.8
< Date: Wed, 21 Apr 2021 17:07:58 GMT
<
{"host":"localhost","path":"/oh/snap","query_string":{"x":"1"},"remote_addr":"127.0.0.1","user_agent":"curl/7.72.0"}

Deploying to Google Cloud App Engine

Create a requirements.txt file:

echo "flask" > requirements.txt

Create an app.yaml file:

printf "runtime: python38\nenv: standard\n" > app.yaml 

Now deploy the app to Google using the gCloud command:

gcloud app deploy

IPAddress vs NetAddr in Python3

I’d heard about netaddr a few weeks ago and had made a note to start. What I learned today as a similar library called ipaddress is included in Python 3, and offers most of netaddr’s functionality just with different syntax.

It is very handy for subnetting. Here’s some basic code that takes the 198.18.128.0/18 CIDR block and splits in it to 4 /20s:

#!/usr/bin/env python

import ipaddress

cidr = "198.18.128.0/18"
subnet_size = "/20"

[network_address, prefix_len] = cidr.split('/')
power = int(subnet_size[1:]) - int(prefix_len)
subnets = list(ipaddress.ip_network(cidr).subnets(power))

print("{} splits in to {} {}s:".format(cidr, len(subnets), subnet_size))
for _ in range(len(subnets)):
    print("  Subnet #{} = {}".format(_+1, subnets[_]))

Here’s the output:

198.18.128.0/18 splits in to 4 /20s:
  Subnet #1 = 198.18.128.0/20
  Subnet #2 = 198.18.144.0/20
  Subnet #3 = 198.18.160.0/20
  Subnet #4 = 198.18.176.0/20

Migrating from CGI to WSGI for Python Web Scripts on Apache

I began finally migrating some old scripts from PHP to Python late last year, and while I was happy to finally have my PHP days behind me, I noticed the script execution was disappointing. On average, a Python CGI script would run 20-80% slower than an equivalent PHP script. At first I chalked it up to slower libraries, but even basic ones that didn’t rely on database or anything fancy still seemed to be incurring a performance hit.

Yesterday I happened to come across mention of WSGI, which is essentially a Python-specific replacement for CGI. I realized the overhead of CGI probably explained why my Python scripts were slower than PHP. So I wanted to give WSGI a spin and see if it could help.

Like PHP, WSGI is an Apache module that is not included in many pre-packaged versions. So first step is to install it.

On Debian/Ubuntu:

sudo apt-get install libapache2-mod-wsgi-py3

The install process should auto-activate the module.

cd /etc/apache2/mods-enabled/

ls -la wsgi*
lrwxrwxrwx 1 root root 27 Mar 23 22:13 wsgi.conf -> ../mods-available/wsgi.conf
lrwxrwxrwx 1 root root 27 Mar 23 22:13 wsgi.load -> ../mods-available/wsgi.load

On FreeBSD, the module does not get auto-activated and must be loaded via a config file:

sudo pkg install ap24-py37-mod_wsgi

# Create /usr/local/etc/apache24/Includes/wsgi.conf
# or similar, and add this line:
LoadModule wsgi_module libexec/apache24/mod_wsgi.so

Like CGI, the directory with the WSGI script will need special permissions. As a security best practice, it’s a good idea to have scripts located outside of any DocumentRoot, so the scripts can’t accidentally get served as plain files.

<Directory "/var/www/scripts">
  Require all granted
</Directory>

As for the WSGI script itself, it’s similar to AWS Lambda, using a pre-defined function. However, it returns an array or bytes rather than a dictionary. Here’s a simple one that will just spit out the host, path, and query string as JSON:

def application(environ, start_response):

    import json, traceback

    try:
        request = {
            'host': environ.get('HTTP_HOST', 'localhost'),
            'path': environ.get('REQUEST_URI', '/'),
            'query_string': {}
        }
        if '?' in request['path']:
            request['path'], query_string = environ.get('REQUEST_URI', '/').split('?')
            for _ in query_string.split('&'):
                [key, value] = _.split('=')
                request['query_string'][key] = value

        output = json.dumps(request, sort_keys=True, indent=2)
        response_headers = [
            ('Content-type', 'application/json'),
            ('Content-Length', str(len(output))),
            ('X-Backend-Server', 'Apache + mod_wsgi')
        ]
        start_response('200 OK', response_headers)
        return [ output.encode('utf-8') ]
            
    except:
        response_headers = [ ('Content-type', 'text/plain') ]
        start_response('500 Internal Server Error', response_headers)
        error = traceback.format_exc()
        return [ str(error).encode('utf-8') ]

The last step is route certain paths to WSGI script. This is done in the Apache VirtualHost configuration:

WSGIPythonPath /var/www/scripts

<VirtualHost *:80>
  ServerName python.mydomain.com
  ServerAdmin nobody@mydomain.com
  DocumentRoot /home/www/html
  Header set Access-Control-Allow-Origin: "*"
  Header set Access-Control-Allow-Methods: "*"
  Header set Access-Control-Allow-Headers: "Origin, X-Requested-With, Content-Type, Accept, Authorization"
  WSGIScriptAlias /myapp /var/www/scripts/myapp.wsgi
</VirtualHost>

Upon migrating a test URL from CGI to WSGI, the page load time dropped significantly:

The improvement is thanks to a 50-90% reduction in “wait” and “receive” times, via ThousandEyes:

I’d next want to look at more advanced Python Web Frameworks like Flask, Bottle, WheezyWeb and Tornado. Django is of course a popular option too, but I know from experience it won’t be the fastest. Flask isn’t the fastest either, but it is the framework for Google SAE which I plan to learn after mastering AWS Lambda.

Using the Built-in GeoIP Functionality of GCP HTTP/HTTPS Load Balancers

GCP HTTP/HTTPS Load Balancers offer great performance, and today I learned of a cool almost hidden feature: the ability to stamp custom headers with client GeoIP info. Here’s a Terraform config snippet:

resource "google_compute_backend_service" "MY_BACKEND_SERVICE" {
  provider                 = google-beta
  name                     = "my-backend-service"
  health_checks            = [ google_compute_health_check.MY_HEALTHCHECK.id ]
  backed {
    group                  = google_compute_instance_group.MY_INSTANCE_GROUP.self_link
    balancing_mode         = "UTILIZATION"
    max_utilization        = 1.0
  }
  custom_request_headers   = [ "X-Client-Geo-Location: {client_region},{client_city}" ]
  custom_response_headers  = [ "X-Cache-Hit: {cdn_cache_status}" ]
}

This will cause the Backend Service to stamp all HTTP requests with a custom header called “X-Client-Geo-Location” with the country abbreviation and city. It can then be parsed on the server to get this information for the client without having to rely on messy X-Forwarded-For parsing and GeoIP lookups.

Here’s a Python example that redirects the user to UK or Australia localized websites:

#!/usr/bin/env python3

import os

try:
   client_location = os.environ.get('HTTP_X_CLIENT_GEO_LOCATION', None)
   if client_location:
       [country,city] = client_location.split(',')
   websites = { 'UK': "www.foo.co.uk", 'AU': "www.foo.au" }
   if country in websites:
       local_website = websites[country]
   else:
       local_website = "www.foo.com"
   print("Status: 301\nLocation: https://{}\n".format(local_website))

except Exception as e:
   print("Status: 500\nContent-Type: text/plain\n\n{}".format(e))

Working with HTTP Requests & Responses in Python

http.client

http client is very fast, but also low-level, and takes a few more lines of code than the others.

import http.client
import ssl

try:
    conn = http.client.HTTPSConnection("www.hightail.com", port=443, timeout=5, context=ssl._create_unverified_context())
    conn.request("GET", "/en_US/theme_default/images/hightop_250px.png")
    resp = conn.getresponse()
    if 301 <= resp.status <= 302:
        print("Status: {}\nLocation: {}\n".format(resp.status,resp.headers['Location']))
    else:
        print("Status: {}\nContent-Type: {}\n".format(resp.status, resp.headers['Content-Type']))

except Exception as e:
    print("Status: 500\nContent-Type: text/plain\n\n{}".format(e))

Requests

Requests is a 3rd party, high level library. It does have a simpler format to use, but is much slower than http.client and is not natively supported on AWS Lambda.

import requests

url = "http://www.yousendit.com"
try:
    resp = requests.get(url, params = {}, timeout = 5, allow_redirects = False)
    if 301 <= resp.status_code <= 302:
        print("Status: {}\nLocation: {}\n".format(resp.status_code,resp.headers['Location']))
    else:
        print("Status: {}\nContent-Type: {}\n".format(resp.status_code, resp.headers['Content-Type']))

except Exception as e:
    print("Status: 500\nContent-Type: text/plain\n\n{}".format(e))

Migrating to MaxMind GeoIP2 for Python3

With Python2 now EOL, one of my tasks was to replace old python2/geolite2 code with python3/geoip. This does require a subscription to MaxMind to either make the calls via web or download static database files, which fortunately was a option.

Installing Python3 GeoIP2 package

On Ubuntu 20:

  • apt install python3-pip
  • pip3 install geoip2

On FreeBSD 11.4:

  • pkg install python3
  • pkg install py37-pip
  • pip install geoip2

Verify successful install

% python3
Python 3.7.8 (default, Aug  8 2020, 01:18:05) 
[Clang 8.0.0 (tags/RELEASE_800/final 356365)] on freebsd11
Type "help", "copyright", "credits" or "license" for more information.
>>> import geoip2.database
>>> help(geoip2.database.Reader) 
Help on class Reader in module geoip2.database:

Sample Python Script

#!/usr/bin/env python3

import sys
import geoip2.database

ipv4_address = input("Enter an IPv4 address: ")

with geoip2.database.Reader('/var/db/GeoIP2-City.mmdb') as reader:
    try:
        response = reader.city(ipv4_address)
    except:
        sys.exit("No info for address: " + ipv4_address)
    if response:
        lat = response.location.latitude
        lng = response.location.longitude
        city = response.city.name
        print("lat: {}, lng: {}, city: {}".format(lat, lng, city))