250bpm

Economics of Messaging Software

It gets quite complicated to explain what's the difference between traditional business messaging (products like IBM's WebSphere MQ, APIs like JMS and protocols like AMQP or MQTT) and distributed messaging (as implemented by ØMQ). Both are the ways for applications to speak each to another easily. However, once you get to describing the differences, the discussion breaks into lots of messy technical details and the big picture disappears.

To understand the big picture, I believe, one has to understand the economics of the problem area.

Traditional messaging software was conceived in 1980's with financial industry as a target market. The assumptions made back then were as follows:

  1. Transactions are extremely costly (say a $1M bank transfer) and cannot be lost whatever happens.
  2. There's an expert administration team available at 24/7 basis to ensure the system is working properly.
  3. The network topology we care about is relatively small and under full control of the user.
  4. Development of the application is long and costly. Any changes to it are slow and costly as well.

In short, the users back then were willing to sacrifice incredible amount of money, development and administration burden as well as performance and flexibility to get a single feature: almost perfect reliability.

There are few industries that still have this kind of requirements (e.g. banks) and those are served well by the existing range of enterprise messaging solutions.

However, the world have changed since 1980's. The typical requirements today are:

  1. Transaction cost is negligible (think of, for example, tweets — losing one tweet in not a big issue).
  2. The administration team is small and part-time at best (small firms) and non-existent at worst (sensors).
  3. Infrastructure can be extremely huge (millions of clients using a web app) and not necessarily controlled in full by a single user.
  4. Applications are created within days and design can change extremely quickly to address the business requirements.

The feeling that something is wrong with the messaging systems started to be common some 10 years ago.

Interestingly, instead of realising that bank-style reliability requirements are just blown out of proportion for basically any other purpose, the problem was framed simply as "messaging is too expensive".

This line of thought brought us the new wave of centralised messaging systems that popped up in 2000-2010. The focus here is on lowering the licensing, development and administration cost while still preserving the enterprise-level reliability. New products are generally open-source with more or less permissive licenses and thus no associated licensing cost. APIs get somewhat simpler and the learning curve gets more flat. Solutions are mostly able to work out of the box, with no complex installation process.

Also, the high cost was attributed to the effective duopoly exercised by IBM and Tibco and multiple initiatives emerged to standardise the messaging and thus seed the free market in messaging solutions. First, there was an attempt to standardise the API (JMS, in 2001), later on multiple attempts to standardise the wire protocol (AMQP, STOMP, in ~2005).

Unfortunately, the above solutions focus on treating the symptoms, not the cause. Instead of addressing the new business requirements, they address the old requirements in a cheaper way.

Let's think about it a bit…

Firstly, 99% of modern distributed applications don't need enterprise-level reliability. Most of the message traffic today is composed of transient and easily disposable content. Even where there is a content with financial value, the value of the content is so low that cost of a lost message is negligible when compared to development and maintenance of a fully reliable enterprise-grade messaging system.

What's even worse, the enterprise-level reliability is the kind of feature that affects basically any other feature in a negative way. It makes these other features hard to implement, difficult to use, it may result in mutilated and counter-intuitive semantics or — in the worst case — it can make these new desirable features impossible to implement.

Thus, we are stuck with solutions that address a problem that we don't care about ("six-nines" availability) and don't address actual problem we are facing: unlimited scalability, multi-tenant environments, topological flexibility, development cost close to zero and no maintenance cost whatsoever.

Let me give you few examples:

To ensure a guaranteed delivery, we need a mandatory component (broker) in the middle of network that will store messages into database etc. Now, if we want to pass messages between two fully automated applications, we cannot do that directly, we have to pass them through the broker. We need a special box to run the broker. It has to be installed. It has to be administered. The cost goes up beyond what a small company can afford. Fail! They are going to use raw TCP connections instead.

Another example: When distributing same feed of messages to multiple clients what should we do when one of the clients stops receiving the messages? They have to be stored somewhere in the meantime. The memory and the disk space will eventually run out on the storage box. What should we do then? In enterprise environment the administrators are monitoring memory and disk usage and will fix the problem before it causes any harm. In fully automated system we'll get into trouble. What's even worse, as opposed to the enterprise environments where problem doesn't happen often (the number of receiver applications is strictly limited, they are well maintained and not likely to get stuck or fail) in modern Internet environment the receivers can be highly unreliable or even malicious. Moreover, there can be thousands of them. The component storing the messages is going to be out of memory and/or disk space all of the time! Fail!

Finally, imagine the situation when the network is down. What happens with enterprise-grade messaging system is that messages are stored in DLQ (dead-letter queue) where the assigned staff person can check them and execute the lost transactions via telephone, send them via FedEx or whatever is appropriate. Now, in modern lightweight Internet firm you have no person assigned to check the DLQ. What now? Are we going to simply throw away the messages in DLQ? If so, why bother having reliable messaging system at all? Fail!

In summary, I believe the current state of business messaging ecosystem is caused by ignoring the real-world requirements and focusing on requirements we've inherited from 1980's corporate world. This blog post won't try to dive into details of how ØMQ attempts to address the new emergent requirements of modern Internet ecosystem. The details can be found elsewhere. I hope, though, that it gives 30,000 feet perspective of the differences between traditional MQ solutions and ØMQ.

Happy Lunar New Year everybody!

January 23rd, 2012