Observability and Instrumentation

This is a quick post that will cover observability and instrumentation, a technique that can be used in monitoring applications.

This post will cover a few implementations and will not cover everything out there or recommend one of the other to you. Additionally each language and tech stack may have it’s own recommended toolset.

Let’s start with a definition

Simply put, it’s the ability to measure the internals of a system.

This is when you are monitoring the application. For example, pinging a healthcheck endpoint to know if a system is up.

Being able to monitor your application is a part of ensuring the reliability of the system(s) you build and maintain and allows you to be reactive and proactive and aid troubleshooting of issues.

For instance, if you had high load on a particular day, you want to see when resources start to increase and know what is “healthy” and identify when issues may arise.

Real Life Examples

I’ve been in tech since 2006, and I’ve seen some interesting production issues over the years.

One example that was widely known was back in the early days of Twitter, the site would experience performance problems as they grew in popularity. Users would see images similar to the below when there were performance issues.

When performance problems begin to occur on your application, you want to know to detect this before major problems occur and deal with it quickly.

Below is an example of traffic at peak load for a particular site. Being able to view the trend of an application’s performance provides insight and allows you to be proactive with monitoring.

Instrumentation

With Instrumentation, you specify how you want to observe the internals of the application.

Here are a couple of examples:

  • The system posts data to a payment gateway. You want to know how many requests were successful and how many failed. Here you could use a counter to count the number of successes and number of failures. You could then graph the counts and compare the numbers to highlight issues. You could set up alerting when a high number of failures occur.
  • You create an API endpoint for external systems to fetch data. You want to know how long requests can take and observe where bottlenecks in your code is. You may want to wrap timers around calls to the database and calls to other external systems. You can then graph the time particular calls make to understand these bottlenecks.

There are a lot of tools out there for observability and instrumentation. I’ll go through a few here (from real life examples and referring to documentation).

(1) Custom Instrumentation

This is used to capture metrics for applications. What’s typically available include — Counters, Timers, Histograms.

Using Graphite

Below is an example in Ruby using Graphite.

Source: https://github.com/kontera-technologies/graphite-api

The example above is incrementing multiple metrics jobs_in_queue and num_errors by 1

Using Prometheus

Here is an example in Elixir using Prometheus

Source: https://hexdocs.pm/prometheus_ex/Prometheus.Metric.Counter.html

Let’s say you wanted to count the number of times an API endpoint was called you could implement the following:

MyServiceInstrumenter.inc("Locations API Endpoint")

This would count the number of times the locations API endpoint was requested in the application.

Here is an example of how the data is captured (in buckets) with Prometheus

To query the data

Using custom instrumentation from APMs

Other APMs like New Relic and AppSignal allow you to instrument your application too.

Here is an example from New Relic in Ruby

Source: https://docs.newrelic.com/docs/agents/ruby-agent/api-guides/ruby-custom-metrics

More info: https://docs.newrelic.com/docs/agents/ruby-agent/api-guides/ruby-custom-instrumentation

(2) Graphing Metrics

Once you’ve captured your application’s metrics you will want to visualise and report on these.

Having a visual representation of the data helps with monitoring and troubleshooting, where you can monitor healthy behaviour and spot abnormal behaviour.

Some tools also allow you to set alerts when certain thresholds are reached.

Using Grafana

Here is an example of a Graph in Grafana.

Metrics can be captured using Prometheus or other tools like Graphite or InfluxDB.

What’s happening here?

In the graph above, we are counting the amount of messages being sent from one system to another for processing, over a period of time.

  • Processed — indicated the messages that were processed by system A to be sent for processing by System B.
  • Failed — counted the number of messages that failed to be sent to system B
  • Sending — number of messages sent to system B.

Can you see something interesting above?

The number of processed exceeds the number of sending i.e. system A is processing more messages than system B.

What does this tell us?

It tells us there is a problem. There must be a problem in the application where half the number of messages are not being sent to system B.

By having a graph like the above, you can spot an issue that may not be easy to spot.

Here’s another example of using a timer to see how long requests take

(3) Using an APM (Application Performance Monitoring)

An APM allows you to monitor the performance of your application, showing metrics on response times and error rates.

Using an APM requires a bit of setup on your application to capture metrics.

Using New Relic

For example New Relic requires installing an agent on the server to capture metrics like response times and other tracing mechanisms.

New Relic allows you to drill into how long time was spent at various layers of the application (e.g. at the database query to the calling code).

Here is an example dashboard you see in New Relic. The dashboard shows you the response time for the application, with transaction times and error rates.

Source: https://docs.newrelic.com/docs/apm/new-relic-apm/getting-started/introduction-new-relic-apm

You have the ability to drill into a transaction to understand more of where timing was spent in the transaction.

Here is an example of the time a request took to respond, with the breakdown of time spent at different layers of code/systems.

Using AppDynamics

Here is a similar view in AppDynamics

The tools and techniques above are just some examples. There are lots more tools out there, each for different needs of different types of applications.

I hope this short post provided some insight into a few techniques of how you can instrument and monitor applications you build.

Engineering Manager (Ruby, Java, Elixir) | Crafter | Traveller. Lives: London/Sydney. Passionate about growing opportunities for people in Tech.