The VMs were deployed via Terraform using instance templates, managed instance groups, and an internal TCP/UDP load balancer with a forwarding rule for port 3128. Debian 11 (Bullseye) was selected as the OS because it has a low memory footprint while still offering an nice pre-packaged version of Squid version 4.
The first problem is the older stackdriver agent isn’t compatible with Debian 11. So I had to install the newer one. I chose to just add these lines to my startup script, pulling the script directly from a bucket to avoid the requirement of Internet access:
gsutil cp gs://public-j5-org/add-google-cloud-ops-agent-repo.sh /tmp/
bash /tmp/add-google-cloud-ops-agent-repo.sh --also-install
After re-deploying the VMs, I ssh’d in and verified the Ops agent was installed and running:
sudo systemctl status google-cloud-ops-agent"*"
google-cloud-ops-agent-opentelemetry-collector.service - Google Cloud Ops Agent - Metrics Agent
Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service; static)
Active: active (running) since Fri 2023-02-10 22:18:17 UTC; 18min ago
Process: 4317 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=otel -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY} (code=exited, status=0/>
Main PID: 4350 (otelopscol)
Tasks: 7 (limit: 1989)
Memory: 45.7M
CPU: 1.160s
After waiting a couple minutes, I still didn’t see anything, so I downloaded and ran their diagnostic script:
gsutil cp gs://public-j5-org/diagnose-agents.sh /tmp/ && bash /tmp/diagnose-agents.sh
This was confusing because while it didn’t show any errors, the actual log was dumped to disk in a sub-directory of /var/tmp/google-agents/
. and did indicate a problem in the agent-info.txt
file:
API Check - Result: FAIL, Error code: LogApiPermissionErr, Failure:
Service account is missing the roles/logging.logWriter role., Solution: Add the roles/logging.logWriter role to the Google Cloud service account., Res
ource: https://cloud.google.com/stackdriver/docs/solutions/agents/ops-agent/authorization#create-service-account
And this made sense, because in order for Ops Agent to function, it needs these two IAM roles enabled for the service account:
- Monitoring > Monitoring Metric Writer.
- Logging > Logs Writer.
Here’s a Terraform snippet that will do that:
# Add required IAM permissions for Ops Agents
locals {
roles = ["logging.logWriter", "monitoring.metricWriter"]
}
resource "google_project_iam_member" "default" {
for_each = var.service_account_email != null ? toset(local.roles) : {}
project = var.project_id
member = "serviceAccount:${var.service_account_email}"
role = "roles/${each.value}"
}
Within a few minutes of adding these, data started showing up in the graphs.
