Loki: Distributed Tracing Zipkin/Prometheus Mashup with Tom Wilkie
April 28, 2017 / Sonja Schweigert
Weaveworks Director of Software Engineering Tom Wilkie was busy at CloudNativeCon + KubeCon Europe 2017. Between showing off Cubienetes and presenting browser metrics in Prometheus, he found time to discuss one of his personal projects. This talk was titled “Loki: An OpenSource Zipkin / Prometheus Mashup, Written in Go.” In the talk, he discussed how he wrote his own distributed tracing system called Loki. Based on OpenZipkin, Loki uses Prometheus’ service discovery system to actively pull traces from the application, unlike Zipkin where they are usually pushed from the application to the tracing system..
What is OpenTracing?
OpenTracing is a set of API for instrumenting applications for distributed tracing. Distributed tracing is one of the most direct ways of looking at how a complex system is behaving. A single trace consists of spans, which are timed operations that represent a contiguous segment of work in a specific trace. Spans are also named, and have a causal link to their parent and child spans. This structure allows you to analyse a single request as it bounces between services in your architecture, and easily identify the critical path, where time is lost processing requests.
Why make your own?
Tom admitted he might be a little crazy to have written his own distributed tracing system. But he did have a good reason: he was debugging a latency issue found with Weave Cortex, our Prometheus as a service software that is part of Weave Cloud. Rather than using what was available, he decided it would benefit him to build his own. (It also seemed like a fun project).
What makes Loki different from Zipkin? Since it’s based heavily on Prometheus, it uses a pull system rather than a push system. This is exactly what makes Prometheus different from Graphite. To learn more about the pull method, download our white paper “Application Monitoring with Weave Cortex: Getting the Most out of Prometheus as a Service.”
In Loki, Tom created a client library that keeps pending spans in an in-memory ring buffer. If the spans are not collected frequently enough, they will be dropped. This service discover and retrieval library in the Loki tracers recognizes the identity of the scraped endpoints and can annotate received spans with this information. With this, jobs don’t need to know their own identity (that’s Loki’s job), and the identify of jobs is consistent across your tracing and monitoring system (Prometheus).
Loki is completely open source and compatible with Zipkin’s API.
What Loki doesn’t do (yet)
While Weaveworks currently uses Loki in their development environment, there are still a few things to work on before it can be used in a production environment. Here are a few of Loki’s current weaknesses as of this presentation:
The client library does not yet support multiple scrapers
+ Loki’s query performance is poor
+ Single-process architecture limits scalability
+ Sometimes spans don’t get delivered, leading to incomplete tracers
+ Tom hypothesizes this is because of pull system “jitter”
There are still some things to do to make Loki more reliable. Here are just a few Tom highlighted in his talk:
- Make it support multiple tracers
- Support languages other than Go
- Allow local storage, perhaps with BoltDB
- Use cloud storage to make it distributed
To get the full picture of this Zipkin/Prometheus distributed tracing mashup (and see the demo in action), watch Tom’s presentation below:
Thank you for reading our blog. We build Weave Cloud, which is a hosted add-on to your clusters. It helps you iterate faster on microservices with continuous delivery, visualization & debugging, and Prometheus monitoring to improve observability.