Digging Deeper: key modules of a scalable trading platform

To see what’s going on under the hood of a trading platform our development team worked on, we sat down with Marko Kruljac.

Marko was one of the key people involved in the project, so we got him to tell us more about what makes this advanced crypto platform tick.

OK, when we first talked about this project, you told me that the entire platform has 6 main components.

MARKO: Yeah, we developed the platform to have 5 main modules all built around the central one – Apache Kafka that serves as a sort of a core information highway for the entire platform. Those five main modules are:

  1. Kolektorka
  2. Deduplikator
  3. Kasirka
  4. Streamerka
  5. Arhiverka

In addition to those 5, we also have 8 supporting modules.

Let’s focus on those 5 main components and Kafka.

MARKO: Right, right. Now, how can I describe the architecture of the product so you’ll understand it? OK, imagine a large conveyor belt. And there are boxes being placed on the belt – and as they travel across they are processed and analyzed, and when they come to the end, you get a finished product.

The first element in this entire process, the one that places the boxes on the conveyor belt is Kolektorka. This module collects information about executed trades and about the state of the order book. These two pieces of information are crucial for the entire platform because we use them to calculate all elements that need to be presented to users. Currently, there are between 120 to 140 Kolektorka modules active inside the platform and they are collecting information from more than 100 crypto exchanges.

So how do they work?

MARKO: Basically, we started off with 1 Kolektorka and gave it a list of exchanges it needs to visit and told it what data it needs to ask those exchanges for. When it gets the needed data, it puts it on the conveyor belt (which is Kafka, but we’ll talk about that later).

And that’s generally what this Kolektorka module does. I mean, the entire architecture of the system is built around small, independent, stupid modules doing most of the work, because that’s easy to create, maintain and scale. And Kolektorka is just that – small, distributed, isolated and stupid.

When you say stupid…

MARKO: Stupid means it does only one thing, but it does it extremely well. I say stupid, but maybe the better term would be extremely specialised. You see, there is no special logic behind a module like that. I mean, everything Kolektorka does can be explained in one sentence – go to the exchange, gather data and send them to Kafka.

It has a defined list of exchanges it needs to visit and exact crypto pairs it needs to gather the data on. We mostly want to collect data on all available pairs from all available exchanges, because the more the better. The more data you have, the more competitive edge you have – that’s why we want it all.

So yeah, Kolektorka is pretty straightforward – it gets a task to go to, for example, Bitstamp and collect data on the BTC/USD pair. It knows how to get to Bitstamp and how to ask for the data it needs. You see, exchanges won’t give you data for all pairs at once, so you have to get them one by one. If you need the data about trades, you’ll get the info on all trades executed from the last time you asked for it. If you need the data about the state of the order book, this gets a bit more complicated because the order book can be extremely large, but in general, you just need the top 10 to 20 percent for analysis.

OK, so now you got the info. The thing is, this is all extremely slow because you have hundreds of exchanges and thousands of crypto pairs. If you have just 1 Kolektorka, there’s a way to get data on specific pairs from multiple exchanges at the same time because they don’t communicate with each other, but there’s no way to get data on all pairs from a single exchange because every exchange limits the number of times you can ping their API. They put the limit to secure themselves from DDOS attacks, and it basically means you have to wait anywhere between 2 to 5 seconds between requesting data again.

As I said, that’s extremely slow because we need data in real time. In the initial phases, we solved that problem by brute force. We didn’t have 1 Kolektorka, but 100 with different IP addresses. And one will ask for BTC/USD from exchange A, the other for BTC/EUR from exchange A and so on and so on until we cover all the pairs.

And when you have 100 Kolektorka modules…

MARKO: Exactly, you get 100 pieces of data in an instant. Instead of using 1 Kolektorka and waiting 200 seconds to get all the pieces of data you need, with 100 Kolektorka modules you get it all in 1 second.

As I said, when we started, this was the way to solve the problem with exchanges limiting the number of requests we could send, but it all changed some 6 month after we started.

What happened?

MARKO: Well, more and more exchanges started supporting socket technology and they basically said – don’t ping us every couple of seconds, just connect to our exchange using sockets and we’ll send you the data when any change occurs.

When we started, we had a Kolektorka module constantly sending requests to the exchange – give me the data, give me the data, give me the data. And most of the time it got the same answer – nothing changed, nothing changed, nothing changed. That wasted resources for both us and the exchanges. Now, with sockets, you don’t have to constantly send requests and ping exchanges, because they send you the info you need at the exact moment something happens.

I’m sensing a ‘but’ here…

MARKO: But… it wasn’t so simple. You see, we had to reprogram the modules to support that. The initial ones were all Rest Kolektorka modules and they were super simple to develop because you had a complete library you could use – it was plug and play, and it worked perfectly. When we wanted to turn them into Socket Kolektorka modules, we had to code them from scratch for every single exchange.

Let me guess, every exchange has something special and different, which makes the process extremely difficult?

MARKO: Exactly! None of them work like you’d expect. They are all special in their own nasty way, and that made our job so much harder. I mean, there were situations where there was absolutely no documentation on how to connect to them using sockets and even less on getting them to send you the data you need. But we did it – developed it from scratch for all the exchanges that supported it.

So what does it mean for the system?

MARKO: Well, you no longer need to have 100 Kolektorka modules to connect to 100 exchanges because you’re not limited and don’t have to wait. You can have 20 modules beacause each one can easily connect to 5 exchanges and get info from them. And all it has to do is receive data and send to Kafka. This was the ideal solution because we get the data in an instant and we don’t waste resources on constant pinging.

OK, but then why do we still have 100 Kolektorka modules in the system?

MARKO: Backup. We have them for backup, and because not all exchanges support sockets. And the thing is that we want to collect data from all exchanges.

Got it. Are we done with Kolektorka?

MARKO: Well, we just covered the trades. Kolektorka also collects data on order books, and that’s a whole other beast. Size-wise. See, one trade is about 15 kilobytes, while an order book has hundreds of orders on the left side and hundreds on the right. And you have to collect them all, for all crypto pairs and from all exchanges. That’s more than 15,000 unique markets.

And that’s all Kolektorka’s job?

MARKO: Exactly. It needs to continuously collect order books, but that’s not 15 kilobytes, it’s more like 200. And that can fill up the available space pretty fast. But the thing with order books is that, although they’re large, they change rarely. For example, if you view it at T=0 and then T=0+0,5 seconds – only a couple of top rows have changed. Everything else remained the same.

This is where we did an awesome little thing to optimize everything – instead of sending the entire order book to Kafka every single time, Kolektorka sends it just the first time and identifies it as an original. And then just sends the data that’s different from that original – basically patching the order book every time it receives new data. And the best thing is that each of these patches is 5 kilobytes.

By doing that we compressed the data, saved terabytes of space every week and made everything work even smoother. And that’s that when it comes to Kolektorka.

Nice, onto the next one!

MARKO: OK, the next element on our conveyor belt is the Deduplikator.
You see, Kolektorka modules are redundant – there are 4 to 5 of them that collect the same things at the same time. Now I know what you’re gonna ask – why do you need 5 copies of the same data? Well, what if one Kolektorka dies, what if its IP gets banned, what if there’s a power outage – whatever it may be, you still have other sources active. The data must go on. Get it.

As great as that is, it also leaves us with a ton of duplicate data. That’s where Deduplikator comes into play. Just as I said about Kolektorka, it’s also a stupid module. Its task is to be a gatekeeper – to only let the new data pass through, while blocking duplicates.

OK, how does it work?

MARKO: Well, it’s like this. If you want to build your own Google that could browse only one website – that’s super simple. But if you want to develop one that could browse the entire internet, well that’s a multi-billion dollar project.
Creating a Deduplikator module that would remove duplicates from one exchange – piece of cake. Develop one that would deduplicate terabytes of data…

Mission impossible…

MARKO: Let’s just say that you need a really, really smart solution to do something like that. And we managed to find that solution in Redis.

It’s a database whose main feature is that it’s extremely fast because it doesn’t save data on the disk, but on RAM. And it’s a crucial element of deduplication. The Deduplikator module reads the info about a trade and lets it pass, and then goes to Redis and says something like: “Hey there Redis, i just let this trade go through under this key (the key consists of the exchange name, exact crypto pair and the ID of the trade). If the trade with this same key comes again, remind me that I’ve already let it through because I can’t remember that stuff.”

As I said, Deduplikator is also a stupid module – it does only one thing extremely well and can’t do anything else. Deduplikator can let data go through and inform Redis about the key, but it can’t know what exactly it let through. And that works perfectly.

Now, there’s a little caveat – there’s 50 gigabytes of data on Redis, which means you need 100 gigabytes of RAM to make it work. And RAM can be expensive if you plan on storing all that data. That’s why we decided to optimize the whole thing and said OK Redis, you need periodically delete all data that’s older than 1 week if we want this to be stable. And like I already said a couple of times – this also works perfectly.

Here’s what we’re thinking – if it’s been more than a week, nobody’s going to collect that data again, and there will be nothing to deduplicate. It’s old news. I mean, we could collect data from the last three years, but if you know you’ll never get duplicate data, you don’t need it.

But that’s just the trades…

MARKO: Right, that was the easy part. Deduplicating order books was a hell of a job. Deduplicating trades is simple because you have a key – exchange, pair, trade ID. If it’s new, let it pass. If it matches, throw it out.

But order books, oh boy. You have 100 Kolektorka modules that see the same or just slightly different order book, and they all pass that information at the same time. And now your job is to figure out the timeline and rebuild the order book. You need to figure out what’s the truth, and what that order book really looks like – and only 1 way is the correct one.

Now wait a minute, how come Kolektorka modules don’t have the exact time when a piece of data was collected?

MARKO: They do, but it’s the job of a Deduplikator module to take them all together and put them in the correct order. You see, sometimes delivery can be a little late, and when you receive hundreds of pieces of data every second, being a little late can mess up the system.

OK, so how did you solve it?

MARKO: Well, Redis did most of the heavy lifting. But the thing is that you can never have it fully correct. There’s always that fraction of a percent that might be wrong and you have to live with it. But it’s easy to live with it because that fraction of a percent…

It’s practically irrelevant in a couple of seconds because of the sheer amount of data that the system collects.

MARKO: Exactly, spot on. Basically, the Deduplikator module get all those snapshots of different parts of the order book, takes them all together and recreates one uniform information that it sends through. It’s extremely complicated, but it works. You got rid of duplicates, you solved the problem with the disk and you solved the problem of having to do the same thing over and over and over again. And this leads us to our next big module.

Kasirka?

MARKO: Now that we have all that data, we have to do something smart with it. And we are, although the module that does it is again very stupid, or extremely specialized as we said before.

Kasirka is a module that does technical trade analysis – that’s a set of mathematical functions and formulas that, when coupled with trade information, give you metrics, and additional data that can help you with creating trading strategies. One of the most basic examples of this would be a simple moving average – a graph showing the price through time. This was just an example, Kasirka does a wide variety of simple and much more complex analysis.

The great thing about developing this was that we didn’t have to do it from scratch – there’s a full library we could just connect and implement – it’s called TA-Lib, technical analysis library. It’s written in C, which means it’s extremely fast, and it’s basically a black box. We can, for example, say that we need an RSI (Relative Strength Index) indicator – we input the data, TA-Lib does its magic, we get the numbers we need and send them through our conveyor belt. This library makes it as simple as that.

Now let’s get into a little bit more detail. Kasirka is not one module, it’s basically three – there’s a Candle Maker Kasirka, Technical Analysis Kasirka and Index Kasirka. The first one on our conveyor belt is the Candle Maker kasirka – it just receives trade data and aggregates them in 10 different time intervals, starting with 1 minute. It aggregates, it makes its calculations and then it sends the data to Kafka. That’s all it does, and that’s extremely important because its output is the input that Technical Analysis Kasirka and Index Kasirka need to perform their tasks.

The great thing about the Kasirka module is that it’s fully scalable. Let’s say you have 100 pairs you need to analyze – if you have 1 Kasirka it will analyze all of the pairs. If you have 2 Kasirka modules, they’ll organize themselves through kafka and each one will analyze 50 pairs. If you have a 100 of them, each will analyze 1 pair. There are no limits, it’s infinitely scalable.

How many Kasirka modules are currently in the system?

MARKO: There are 50 modules and the whole thing works extremely smoothly.

OK, what’s next?

MARKO: Well, I think it’s finally time we tackle that Kafka element we’ve been talking about from the beginning. Kafka is basically the core of that conveyor belt. It’s LinkedIn’s distributed system for communication and delivering messages, and it’s the only component that we didn’t develop from scratch – we just took the entire solution, configured it, installed it, connected everything, turned it on and it worked like a charm.

The main thing Kafka enabled us to do is scale the entire system. Whenever you need to upgrade the system, you simply add new nodes (they call them Brokers) that can receive and send new messages. You just add a couple of pieces of new hardware and you’re good – that was benefit #1. The other benefit was that Kafka also enabled us to infinitely scale all our modules – all because of Kafka’s simple technique called partitioning.

Imagine our conveyor belt divided into three parts, with little barriers between them. Now you don’t have one single person working on the entire belt, but three different people, and each one of them has to take care of the boxes only on their part of the belt. That means you can send three boxes at the same time, and all three will be analyzed at the same time. And this partitioning option isn’t limited to three – it enables you to divide the conveyor belt and increase the number of modules as much as you need. It was the ideal solution for the system.

It was basically a plug & play solution?

MARKO: I’d say plug & pray. You see, none of us worked with Kafka before, so we kind of winged it. We rented a bunch of machines on Amazon Web Services, installed Kafka and tried different things until it worked. You see, although the technology has been around for some time, Kafka just recently gained some traction. This means there are limited resources online, so working with it is basically a trial & error process. And after a lot of that trial & error, we ended with the right solution we implemented into the final platform.

And that’s basically it, that’s Kafka. I mean, I could talk for hours about it, we could make a whole series about that single component…

But it’s time to cover the final two modules.

MARKO: Yes, Arhiverka and Streamerka. Let’s start with Arhiverka.
What you need to know about our conveyor belt is that it’s flat and that it has an end. And when a box comes to the end, it just hangs on there for some time, until other boxes push it of the ledge. Now, if those boxes aren’t opened, analyzed and stored before they fall of the belt, the data they’re carrying is lost. So if you need some data from three months ago, well tough luck.

But that’s why we have Arhiverka. This module basically archives all of the boxes – it takes them from the belt, makes an exact copy and saves it in its database. And you need that because you need access to historical data.

So you have the streaming platform which gives you current, real-time data. And you have Historical which shows you how the markets changed through time.

So, for how long is it stored?

MARKO: How long? Indefinitely. It’s cold storage.

That’s a whole lot of data…

MARKO: Yeah, that’s a whole lot of data. Last I checked, there were 700 million database inputs. And that’s some. Give me a quick second to crunch the numbers… That’s some 13 gigabytes per minute. I eyeballed it, it’s a napkin calculation. I could be wrong, but I think this is more or less somewhere around the correct number.

So yeah, it’s a whole lot of data. Enormous, just enormous amount of data being collected every day. And when you know that each piece of data is just a couple of kilobytes, the entire thing is even more amazing. While we were testing the platform during the development, we filled 5 terabytes in about 2 weeks.

Damn, that’s… I don’t even know what to say. While I wrap my head around this, tell me more about the last module.

MARKO: Ah yes, Streamerka. Amazing module.
Streamerka is actually data streaming as a service, and its task is to be the interface for customers. This is the module that enables them to connect to the system and to receive data about crypto pairs from exchanges.

By using Streamerka, users can subscribe to get information about trades, order books, technical indicators, candles, indexes. Users can even subscribe to get news and sentiment – they can see whether the crypto community has positive or negative feeling about certain cryptocurrencies. For example, users may want to get information about all pairs available on Bitstamp, or maybe they just want to get info about BTC/USD, but from all exchanges. Due to the amount of raw data the system collects from more than 100 exchanges, the possibilities to get personalized information and analyses are endless.

The entire system is, of course, built on socket technology, so users don’t have to constantly ping the system. They just need to subscribe to the information they want to receive and the system delivers it to them in real time. They just need to have an active subscription.

And I believe that’s it. We covered them all – Kafka as the main central highway, the conveyor belt of our story, and 5 key modules that collect, analyze and distribute data.

 

That’s all for this edition of Digging Deeper. If you would like us to work on your startup, contact us and let’s discuss your idea!