Tracking Your Business

You've built (or are maintaining) a product which has many services that span over different machines at the backend. These services are all orchestrating together to implement one or many more business processes.

How are you tracking it?

In general, how can you provide visibility for

  • Series of processing stages that are arranged in succession
  • Performing a specific business function over a data stream (i.e. transaction),
  • Spanning across several machines

Note: I'm using the terms 'transaction', 'workflow' and 'pipeline' interchangeably to mean the same thing - a series of actions bound together, leading to a final goal under the same business process.

Order Tracking

A simple, somewhat crude example for cross-systems transaction would be an order preparation system in an electronics factory.

During such a workflow, an order entering the processing pipeline goes through each stage defined by the manufacturing floor manager - "planning, provisioning, packing, shipping".

Background Jobs & User Purchases

Taking this a bit closer to the Web, we can easily see instances of such transactions, even if we are not always aware we've implemented them that way. A background job is a pipeline, or a transaction, of one process.

A user ordering an item from your online store is another example where multiple stages are typically involved - charging, invoicing, etc. (perhaps some of these are even done with the help of third-parties such as Paypal or Stripe).

Tracking In Practice

So how can you track these at the infrastructure level? namely, how would you:

  • Have better visibility for an entire such process which may start at machine A and service X, and then ends a few machines and services later at machine B and service Z.
  • Measure the overall performance of such a process across all of the players in your architecture and at each step of the way.

Internal Tracking

You may have bumped into this before. Referring back to the previous real-life manufacturing example, an item gets a "ticket" (a lot of time it's pink isn't it?) slapped onto it when it is first pronounced as an actual entity in the factory.

This ticket is then being used to record each person who handled the item, and the time it was handled at.

Looking back at a distributed system implementing a pipeline, if the data handed out from process to process is such that you can tack on additional properties, that is - it will be persisted after each step, and persisting it doesn't cost that much, then you may be in luck.

In such a scenario it is common to include tracking metadata within the object, and just stamp it with relevant trace information (such as time) per process, within the lifetime of that object and the length of the pipeline.

At the end of the entire business process, a given object will show you where it's been and when. This idea would be easy to implement and provide excellent forensics ability - you can investigate your pipeline behavior per process step within the pipeline, by just looking at the object itself.

If we dig deeper into this sort of a solution though you'll find that there are a couple of pitfalls that exist when you realize that this is a proper distributed system performing a single goal of tracking:

  • Since you're tracking time, time must be synchronized on all machines. This may only seem easy on first glance, becomes harder when you're measuring sub-second)
  • Failure points. Additional moving parts in the process, probability of failure goes higher.

External Tracking

You may also have been aware of systems in factories, or even physical shops, where operators feed in an item ID, their signature and time stamp onto a thin terminal in order to indicate they have processed it at their station.

Keeping that in mind, the solution I want to discuss here involves an external service, to which the pipeline simply announces progress per each step of the pipeline as the item being processed is making progress.

If you're originally coming from the enterprise, you've already identified such a thing as something somewhat similar to a BAM.

And if you don't like enterprisy solutions to problems, you may have also heard of taking this concept to a much lower-level infrastructural kind of thing - Google's Dapper and not very long ago Twitter's Zipkin systems, that offer extremely detailed information about linear and tree-based transactions, and show you an immense trace-level breakdown of a processes within your code.

Choosing a Solution

Although I could use an off-the-shelf enterprise BAM product, I really didn't want the kind of world of pain you get when integrating an enterprise product with an agile, lightweight, startup-like infrastructure.

And then I didn't feel that using a system like Zipkin for higher-level much less granular business processes was right.

I wanted a simple system. Simple to maintain, simple to work with. As simple as an HTTP call.

Since I had such a system figured out on the back of my head for a while now, I built it during the last weekend - the result is Roundtrip.

Using Roundtrip

A given distributed system may generate a ton of business workflows and transactions over many or few machines, and the point is that a transaction or a workflow starts at a certain machine, goes to one or more, and then ends up at some other (or same) machine.

We need a way to keep track of when a transaction starts and when it ends. A bonus would be to be able to track stages in the transaction that happen before it ends - lets call that checkpoints.

That is, basically, what Roundtrip is. Roundtrip will store the tracking data about your currently running transactions: start, end, and any number of checkpoints, and will provide metrics as a bonus.

When a transaction ends, it is removed from Roundtrip - this allow Rountrip to be bounded in size of storage and have good performance.

Mix and Match

Roundtrip supports pluggable backends (currently using Redis), metric aggregators (currently using StatsD), and APIs (currently using HTTP and command line). The plan is to support at least UDP and 0mq as APIs for extremely performant systems, although the HTTP API is already pretty good.

The idea is to let you mix and match storage and metrics systems as you see fit.

Transaction lifecycle

Here's a short breeze through using Roundtrip:

Note: in the near future, there will be language-specific drivers so that you just plug the right driver (be it a ruby gem, python egg, or a java jar).

For now, you'll have to make simple RESTful calls the way you usually do to interact with Roundtrip (use HTTP library of your choice ruby - HTTParty, node.js - mikeal/request etc).

I'm using curl here just for the sake of experimentation.

Create a new trip.

This would be where you start your transaction within your code, issue the following HTTP call. You get a trip ID which you carry around the workflow / transaction so that you could end it and place checkpoints on it. curl -XPOST -d route=invoicing http://localhost:9292/trips {"id":"cf1999e8bfbd37963b1f92c527a8748e","route":"invoicing","started_at":"2012-11-30T18:23:23.814014+02:00"}

Add as many checkpoints as you like.

Make sure to provide checkpoint as a postback parameter. Yes, we're using PATCH, and yes, its the correct semantic RESTful use of it :). curl -XPATCH -dcheckpoint=generated.pdf http://localhost:9292/trips/cf1999e8bfbd37963b1f92c527a8748e {"ok":true}

curl -XPATCH -dcheckpoint=emailed.customer http://localhost:9292/trips/cf1999e8bfbd37963b1f92c527a8748e

End your transaction.

You get back a bag of data which represents the trip your transaction made. With this you can update what ever system you have (be it metrics, analytics, health, etc).

curl -XDELETE http://localhost:9292/trips/cf1999e8bfbd37963b1f92c527a8748e

Inquire about pending/stale transactions.

Use this to list out transactions that don't end fast enough, by specifying older_than_secs. You can also get this in RSS format by using .rss instead of .json.

curl http://localhost:9292/invoicing/trips.json?older_than_secs=300

Inquire about transaction performance and metrics.

Refer to your metrics system. By default, it is Graphite. You can inquire about the time it took every checkpoint to complete from start of transaction and the time it took for the entire transaction to complete to end.

Go Forth and Explore

You're welcome to continue and check Roundtrip out, feel free to ask questions, submit issues and pull requests on Github.

PS: Thanks Robert Kiraly (OldCoder) for the technical review!.