I have a script doing real-time log analysis, where about 25 log files are stored in a Google Cloud Storage bucket. The files are always small (1-5 MB each) but the script was taking over 10 seconds to run, resulting in slow page load times and poor user experience. Performance analysis showed that most of the time was spent on the storage calls, with high overhead of requesting individual files.
I started thinking the best way to improve performance was to make the storage calls in an async fashion so as to download the files in parallel. This would require a special library capable of making such calls; after lots of Googling and trial and error I found a StackOverFlow post which mentioned gcloud AIO Storage. This worked very well, and after implementation I’m seeing a 125% speed improvement!
Here’s a rundown of the steps I did to get async working with GCS.)
1) Install gcloud AIO Storage:
pip install gcloud-aio-storage
2) In the Python code, start with some imports
import asyncio
from gcloud.aio.auth import Token
from gcloud.aio.storage import Storage
3) Create a function to read multiples from the same bucket:
async def IngestLogs(bucket_name, file_names, key_file = None):
SCOPES = ["https://www.googleapis.com/auth/cloud-platform.read-only"]
token = Token(service_file=key_file, scopes=SCOPES)
async with Storage(token=token) as client:
tasks = (client.download(bucket_name, _) for _ in file_names)
blobs = await asyncio.gather(*tasks)
await token.close()
return blobs
It’s important to note that ‘blobs’ will be a list, with each element representing a binary version of the file.
4) Create some code to call the async function. The decode() function will convert each blob to a string.
def main():
bucket_name = "my-gcs-bucket"
file_names = {
'file1': "path1/file1.abc",
'file2': "path2/file2.def",
'file3': "path3/file3.ghi",
}
key = "myproject-123456-mykey.json"
blobs = asyncio.run(IngestLogs(bucket_name, file_names.values(), key_file=key))
for blob in blobs:
# Print the first line from each blob
print(blob.decode('utf-8')[0:75])
I track the load times via NewRelic synthetics, and it showed a 300% performance improvement!
