Open Telemetry with Python

I recently encountered OpenTelemetry (OTEL) and have built several Python projects that utilize it. The OTEL experience was so positive that now I want to add it into all previous services I've developed. Because, let's be honest, logs are not a comprehensive way of understanding your application - especially when things go wrong.

OpenTelemetry is an ambitious project that seeks to unify how we collect traces, metrics, and logs. OpenTelemetry's documentation is great but is often hard to navigate due to the many components to the standard. When first getting into OpenTelementry, there are several elements to be aware of:

OpenTelemetry Protocol (OTLP) is an HTTP 1.1 and GRPC protocol used for transmitting traces, metrics, and logs
OpenTelemetry's Language specific instrumentation and exporters
OpenTelemetry Collector, a component that allows delegating where traces, metrics, and logs are sent

With Python being a dynamic language, it supports automatic instrumentation. This is the ability for OTEL to wrap existing code bases and magically inject traces, metrics, and logs with, in many cases, no changes to the code base. The linked tutorial from OpenTelemetry's main page is comprehensive, but I still found some issues which I hope this guide will help with.

I wanted to explore instrumenting a python library not already instrumented. Currently, many libraries are already instrumented including: sqlalchemy, requests, flask, fastapi, grpc, and many more. For this I built an example application github.com/costrouc/plotly-dash-instrument. Our application Plotly Dash uses Flask under the covers. Our sample application also makes use of requests and sqlalchemy just to demonstrate additional modules being instrumented.

Automatic Instrumentation

To instrument an existing application, simply install the needed instrumentation libraries. A big issue I ran into with automatic instrumentation was when the instrumentation was incompatible with the specific library versions I had installed. In my case, this was due to a documented issue Flask version 3 and the instrumentation requiring 2.x. The [instruments] was critical to ensure that the correct Flask libraries were installed.

opentelemetry-distro
opentelemetry-instrumentation-logging
opentelemetry-instrumentation-sqlalchemy
opentelemetry-instrumentation-requests
opentelemetry-instrumentation-flask[instruments]
opentelemetry-exporter-otlp

Additionally, the application must be launched with opentelemetry-instrument. In my docker-compose file you will find:

version: '3'
services:
  dash:
    build: .
    command:
      - opentelemetry-instrument
      - python
      - '-m'
      - 'dash_otel'
    ports:
      - 8000:8000
    environment:
      - OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED=true
      - OTEL_SERVICE_NAME=dash-otel
      - OTEL_TRACES_EXPORTER=otlp
      - OTEL_METRICS_EXPORTER=otlp
      - OTEL_LOGS_EXPORTER=otlp
      - OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://collector:4317
      - OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=http://collector:4317
      - OTEL_EXPORTER_OTLP_LOGS_ENDPOINT=http://collector:4317

It's essential to start your application with opentelemetry-instrument, which mocks all the important objects, e.g. requests.get(...) and flask.Flask. I ran into an additional issue which I think is not well documented and occurs when any application does live reloads e.g. Flask, FastAPI. I've seen several issues within the contrib repository, so this seems like a common issue. Make sure that reload=False is set in the framework you are using. Without this change, instrumentation and exports will not work.

That's it. Notice how we did not need to modify any code within our code base. This is all that is needed to instrument a given python application assuming all needed telemetry collectors and services to collect traces, logs, and metrics have been provisioned.

Manually Adding Logs, Traces, and Metrics

For logging, opentelemetry-instrumentation-logging magically wraps the logging module making the transition nearly seemless. Developers don't need to be aware of OTEL.

import logging

logger = logging.getLogger(__name__)

logger.info('this will be captured by otel')

Traces are a bit more complicated as this feature is not built into Python. Adding it to your application for additional traces, however, is equally simple. Examples of this are in the Dash example for wrapping callbacks.

from opentelemetry import metrics

tracer = trace.get_tracer('myapp')

@tracer.start_as_current_span("dosomething")
def my_function(...):
    ...

In my opinion, metrics are the hardest to add because it first requires developing metrics to be collected. There are many types of metrics (e.g. incrementing, guage, histogram, etc.) which provide a rich way over time of metering a particular value. This part of the standard was worked on closely in collaboration with Prometheus and should look familar to users already using Prometheus. The magic of opentelemetry-instrument is that it allows the application to expose metrics via a compatible Prometheus endpoint /metrics or transmit metrics at a regular interval using otlp.

from opentelemetry import metrics

meter = metrics.get_meter("dash-otel.meter")

num_callback = meter.create_counter(
    "dash-otel.num-callback",
    description="total number of callbacks ran"
)

num_callback.add(1)

Collecting Logs, Traces, and Metrics

Now that we have a given application which exports logs, traces, and metrics - how do we actually visualize and inspect this data? The OTEL specification intentionally does not address this. OTEL intends for the community to innovate on ways to process and analyze the data. OTEL is only a framework to send the logs, traces, and metrics. There already exists many enterprise solutions:

logs: splunk, fluentd, datadog, graylog, cloud provider (AWS, GCP, Azure, etc.), grafana loki
traces: jaeger, grafana tempo, honeycomb, datadog
metrics: prometheus, graphite, datadog, grafana

The providers listed above are not nearly exhaustive, and a majority are not open-source. I wanted to choose a solution which consisted of open source components. In the demo repository in the docker-compose file I've linked: OpenTelemetry Collector, Jaeger, Grafana Loki, Prometheus, and Grafana together. While it is a lot of applications to link together, it is a central place to view traces, log, and metrics allowing for powerful analyses. All components are also open-source! The only other existing enterprise solutions I am aware of that can do this are splunk and datadog.