Monitor

Checking the health of a Microsegmentation Console

The Microsegmentation Console provides a healthchecks endpoint that does not require authentication. An example query follows.
curl https://microsegmentation.acme.com/healthchecks\?quiet\=true | jq
Example response:
[ { "responseTime": "195µs", "status": "Operational", "type": "Cache" }, { "responseTime": "2ms", "status": "Operational", "type": "Database" }, { "responseTime": "346µs", "status": "Operational", "type": "MessagingSystem" }, { "responseTime": "0s", "status": "Operational", "type": "Service" }, { "responseTime": "2ms", "status": "Operational", "type": "TSDB" } ]
It also returns one of the following codes.
  • 200 OK the services and API gateway are operational
  • 218 a critical service is down
  • 503 the API gateway is down
You can use the endpoint as an external health check probe.
If you prefer, you can also use apoctl to discover the status of Microsegmentation Console. For example, the following command returns status information in a table format.
apoctl api list healthchecks -o table -c type,status
A healthy Microsegmentation Console would respond with the following.
status | type --------------+------------------ Operational | Cache Operational | Database Operational | MessagingSystem Operational | Service Operational | TSDB

Enabling additional monitoring tools

About the additional monitoring tools

After migrating to the multi-container Microsegmentation Console you can use the following monitoring tools.
  • Prometheus and Grafana to scrape and display metrics about resource usage
  • Jaeger to help you trace and debug issues with the Microsegmentation Console API
These tools are not available for single container deployments.

Enabling additional monitoring tools

The first time you create your Voila environment, the installer will ask you if you want to deploy the monitoring facilities:
[Optional] Do you want to configure the monitoring stack? (y/n) default is no: y [Optional] What will be the public url of the metrics monitoring dashboard? > Public Monitoring URL (example: https://monitoring.aporeto.com, leave it empty to not deploy this compoment (default): https://monitoring.aporeto.company.tld [Optional] What will be the public url of the metrics API proxy endpoint (used for prometheus federation)? > Public metric URL (example: https://metrics.aporeto.com, leave it empty to not deploy this compoment (default): [Optional] Do you want to install the opentracing facility? (y/n) default is no: y [Optional] Do you want to install the logging facility? (y/n) default is yes: y
You can modify the deployment of monitoring facilities after installation as explained below: To enable the monitoring stack:
set_value global.public.monitoring https://monitoring.microsegmentation.acme.com override
This will deploy Prometheus to scrape service data and serve it through Grafana dashboards. It also enables alerts (see below). From there you can add more facilities.
To enable logging facility (recommended):
set_value enabled true|false loki override
This will deploy/remove Loki to scrape all service logs. This is useful to report crashes if any or trouble shot issues.
To deploy a metrics proxy that can be used to remotely query alerts and metrics and perform Prometheus federation:
set_value global.public.metrics https://tracing.microsegmentation.acme.com override
To enable tracing facility (optional):
set_value enabled true|false jaeger override set_value enabled true|false elasticsearch override
This will deploy/remove the tracing facility that will allow you look at every Microsegmentation Console API request. In general, we recommend enabling this to help during development or to analyze slow requests. This can generate several gigabytes of data every hour causing the storage to fill up very quickly.
To enable metrics proxy (optional):
set_value global.public.metrics https://metrics.microsegmentation.acme.com override
This will create service that you can use to pull metrics from outside (see below).
You can pick which facility you want to to deploy depending on what your needs are. Once you have enabled the tools that you need, issue the following command to apply the changes.
doit

Monitoring dashboards

Use the following command to obtain the monitoring dashboard URL.
get_value global.public.monitoring
To access the monitoring dashboards you need to have an auditer certificate.
Use the gen-auditer tool from your voila environment to generate a p12 certificate file in certs/auditers. You must import this to your workstation.

Dashboard metrics

Several metrics are collected using prometheus and shown as dashboards through grafana. An example follows.
This dashboard gives you an operational overview of the platform, the load on the nodes, the storage and the current alerts and logs. This is your go to dashboard when you want to check the health of your platform.
This dashboard gives you more details about the resource usage of nodes and services. You can use these to pinpoint any compute resource contention.
This dashboard provides an overview of the state of the Microsegmentation Console and the general state of the component and compute resources usage. This is your second go to dashboard when you want to check the health of your platform.
This dashboard provide a detailed view of all the microservices. You can use this mostly to debug issues and track leaks.
This dashboard provides a Kubernetes resources allocation view. You can use it to locate resource starvation or overuse.
This dashboard provides advanced information about MongoDB and sharding.
This dashboard is reachable via the Explore feature on Grafana, available from the compass icon on the left. Use the top bar to select the facility you wish to explore. If you enabled the tracing facility, you can select jaeger-aporeto to see the traces.
You can get logged in as an admin in Grafana if needed. The username is admin and you can get the password with get_value global.accounts.grafana.pass.
By default there is no data persistency on the dashboards, if you want to perform some persistent changes, you can enable the persistency by adding storage to Grafana with:
set_value storage.class <sc> grafana override
Where <sc> is the storage class of your Kubernetes cluster.
Then update the Grafana deployment with
snap -u grafana --force

Configuring alerts

Using the metrics gathered some alerts are defined to check the health of Microsegmentation Console and report issues.
Alert
Description
What do to
Node autoscale daily report
Daily report of node scaling
N/A
Service autoscale daily report
Daily report of service scaling
N/A
Service autoscaled limit
Alert when a service reach it’s scaling limits
Check resource usages
API response time increased
When the response time increased on an API
Nothing if it’s transient otherwise check resource contention
API response time degraded
When the global response time is getting too high
Nothing if it’s transient otherwise check resource contention
Service restarted
When a service restarts
Check the service logs via Grafana,Explore,Loki
Error 5xx detected
When a service reports an error 5xx
Check the service logs via Grafana,Explore,Loki
Crash detected
When a service is crashing
Check the service logs via Grafana,Explore,Loki
MongoDB node not responding
When a database node is not responding
Check status of you Kubernetes cluster and pods
MongoDB cluster degraded
When the database cluster is degraded
Check status of you Kubernetes cluster and pods, use mgos status from voila
MongoDB replication lag is too slow
When the database is having hard time to replicate data
Check the resource contention on Mongodb Nodes
Storage capacity is running low
when the storage is running low
Expand the storage
Certificate about to expire
When the public facing cert is about to expire
Renew your cert
Service is not running
When a service is not starting
Check the state of the pod in Kubernetes (k describe pod <podname>)
Infra service stopped responding
When an infra service is not responding
Check the state of the pod in Kubernetes (k describe pod <podname>)
Backend service stopped responding
When a backend service is not responding
Check the state of the pod in Kubernetes (k describe pod <podname>)
High Node CPU Usage
When the CPU usage on node is too high
Might Require to scale up your environment if it persist
High Node Memory Usage
When the memory usage on nodes is too high
Might Require to scale up your environment if it persist
Through the /healthchecks API you can get a summary of the current firing alerts:
[ { "responseTime": "1.123ms", "status": "Operational", "type": "Cache" }, { "responseTime": "6ms", "status": "Operational", "type": "Database" }, { "status": "Operational", "type": "MessagingSystem" }, { "alerts": [ "1 critical active alert reported for database type." ], "name": "Monitoring", "status": "Degraded", "type": "General" }, { "status": "Operational", "type": "Service" }, { "responseTime": "7ms", "status": "Operational", "type": "TSDB" } ]
The lack of details in intentional as this endpoint is public.
Alerts can be sent to a Slack channel by configuring the following:
set_value global.integrations.slack.webhook "https://hooks.slack.com/services/XXX/YYY/ZZZ" set_value global.integrations.slack.channel "#mychannel"
Then update the Prometheus deployment with
snap -u prometheus-aporeto --force
If you want to define your own alerting provider you can pass a custom AlertManager configuration as follow:
custom: alerts: # Your alertmanager configuration goes here global: .... rules: # your prometheurs rules goes here groups: ....

Configuring a metrics proxy

About the metrics proxy

The Microsegmentation Console uses Prometheus to gather statistics on all microservices and Kubernetes endpoints. The metrics proxy allows you to expose those metrics to perform Prometheus federation for instance.

Generating a client certificate

You will need to generate a client certificate that will be used to access the Prometheus federation endpoint: Generate a client certificate with the following command:
gen-colonoscope
Now you will need to configure the client that will scrape the data from the Prometheus federation endpoint with the following parameters:
Once done the alerts and federate endpoints will be available.

Pulling alerts

You can retrieve alerts from the following endpoint.
https://<fqdn>/alertmanager/api/v1/alerts
(where <fqdn> is what you configured as global.public.metrics).
This can be used to pull alerts as documented on Prometheus website.
Example of output:
"status": "success", "data": [ { "labels": { "alertname": "Backend service restarted", "color": "warning", "exported_pod": "canyon-75c9f966dc-g7rgj", "icon": ":gear:", "prometheus": "default/aporeto", "reason": "Completed", "recover": "false", "severity": "severe" }, "annotations": { "summary": "canyon-75c9f966dc-g7rgj restarted. Reason: Completed." }, "startsAt": "2020-01-22T20:59:10.521318559Z", "endsAt": "2020-01-22T21:02:10.521318559Z", "generatorURL": "http://prometheus-aporeto-0:9090/graph?g0.expr=%28%28sum+by%28exported_pod%29+%28kube_pod_container_status_restarts_total%29+-+sum+by%28exported_pod%29+%28kube_pod_container_status_restarts_total+offset+1m%29%29+%21%3D+0%29+-+on%28exported_pod%29+group_right%28%29+count+by%28exported_pod%2C+reason%29+%28kube_pod_container_status_last_terminated_reason+%3E+0%29&g0.tab=1", "status": { "state": "active", "silencedBy": [], "inhibitedBy": [] }, "receivers": [ "norecover" ], "fingerprint": "5a483f5586d6de87" } ] }

Federating Prometheus

You can use the following endpoint to federate Prometheus instances together.
https://<fqdn>/prometheus/federate
Example request:
curl -k https://<fqdn>/prometheus/federate --cert ./certs/colonoscopes/example-cert.pem --key ./certs/colonoscopes/example-key.pem -G --data-urlencode 'match[]={type=~"aporeto|database"}'
This request will pull all current metrics.
Subset of output:
http_requests_total{code="200",endpoint="health",instance="10.64.241.42:1080",job="health-cactuar",method="GET",namespace="default",pod="cactuar-5cdddc64c7-sfwp8",service="cactuar",type="aporeto",url="/oidcproviders",prometheus="default/aporeto",prometheus_replica="prometheus-aporeto-0"} 2 1579736744156 http_requests_total{code="200",endpoint="health",instance="10.64.241.42:1080",job="health-cactuar",method="POST",namespace="default",pod="cactuar-5cdddc64c7-sfwp8",service="cactuar",type="aporeto",url="/appcredentials",prometheus="default/aporeto",prometheus_replica="prometheus-aporeto-0"} 31 1579736744156 http_requests_total{code="200",endpoint="health",instance="10.64.241.42:1080",job="health-cactuar",method="POST",namespace="default",pod="cactuar-5cdddc64c7-sfwp8",service="cactuar",type="aporeto",url="/servicetoken",prometheus="default/aporeto",prometheus_replica="prometheus-aporeto-0"} 1689 1579736744156 http_requests_total{code="200",endpoint="health",instance="10.64.241.42:1080",job="health-cactuar",method="PUT",namespace="default",pod="cactuar-5cdddc64c7-sfwp8",service="cactuar",type="aporeto",url="/appcredentials/:id",prometheus="default/aporeto",prometheus_replica="prometheus-aporeto-0"} 25 1579736744156
Example of Prometheus configuration used to scrape data from the Microsegmentation Prometheus instance:
scrape_configs: - job_name: "federate" scheme: https scrape_interval: 15s tls_config: ca_file: path-to-ca-cert.pem cert_file: path-to-client-cert.pem key_file: path-to-client-cert-key.pem insecure_skip_verify: false honor_labels: true metrics_path: "/prometheus/federate" params: "match[]": - '{type=~"aporeto|database"}' static_configs: - targets: - "<fqdn>"

Checking capacity

Among all the metrics reported, some capacity metrics are also available:
# HELP aporeto_enforcers_collection_duration_seconds The enforcer count collection duration in seconds. # TYPE aporeto_enforcers_collection_duration_seconds gauge aporeto_enforcers_collection_duration_seconds 0.003 # HELP aporeto_enforcers_total The enforcer count metric # TYPE aporeto_enforcers_total gauge aporeto_enforcers_total{unreachable="false"} 0 aporeto_enforcers_total{unreachable="true"} 0 # HELP aporeto_flowreports The flowreports metric for interval # TYPE aporeto_flowreports gauge aporeto_flowreports{action="accept",interval="15m0s"} 0 aporeto_flowreports{action="reject",interval="15m0s"} 0 # HELP aporeto_flowreports_collection_duration_seconds The flowreports collection duration in seconds. # TYPE aporeto_flowreports_collection_duration_seconds gauge aporeto_flowreports_collection_duration_seconds 0.004 # HELP aporeto_namespaces_collection_duration_seconds The namespace count collection duration in seconds. # TYPE aporeto_namespaces_collection_duration_seconds gauge aporeto_namespaces_collection_duration_seconds 0.005 # HELP aporeto_namespaces_total The namespaces count metric # TYPE aporeto_namespaces_total gauge aporeto_namespaces_total 3 # HELP aporeto_policies_collection_duration_seconds The policies count collection duration in seconds. # TYPE aporeto_policies_collection_duration_seconds gauge aporeto_policies_collection_duration_seconds 0.005 # HELP aporeto_policies_total The policies count metric # TYPE aporeto_policies_total gauge aporeto_policies_total 7 # HELP aporeto_processingunits_collection_duration_seconds The processing units count collection duration in seconds. # TYPE aporeto_processingunits_collection_duration_seconds gauge aporeto_processingunits_collection_duration_seconds 0.004 # HELP aporeto_processingunits_total The processing units count metric # TYPE aporeto_processingunits_total gauge aporeto_processingunits_total 0
Those metrics are used on the operational dashboard.

Recommended For You