Big data acquisition — top 5 frameworks and technologies

When it comes to business, more data means more opportunities, but companies that want to utilize data first need to get a hold of it — that process is called data gathering or data acquisition.

Depending on your business strategy — gathering, processing and visualization of data can help your company extract value and financial benefits from it. Our new ebook will help you understand how each of these aspects work when implemented both on their own, as well as when they’re linked together. Download it for free!

In the last couple of years, the amount of data that’s being produced on a daily basis has increased exponentially — from the beginning of recorded time until 2003, humans created 5 exabytes of data, and in 2011 the same amount was created every two days. Today, these amounts are created in minutes. That is why data gathering is extremely important in today’s world.

When explaining what exactly is data gathering, we can say that it is the process of acquiring, filtering and cleaning data before it is put in storage on which data analysis can be carried out.

The main aspect of data gathering lies in the process of its acquisition from distributed information sources with the aim of storing it in data warehouses that are scalable and that can handle extremely large volumes of complex data.

To achieve this goal, businesses need these 3 main components:

  1. Protocols that enable the gathering of information from distributed data sources
  2. Frameworks through which the data is collected by using different protocols
  3. Technologies that allow constant storage of data acquired through the frameworks

Frameworks and tech for data gathering

Data gathering frameworks and technologies are very specific when it comes to their functionalities and ideal usage, so it’s important to define your overall goals before you lock- in on any of them.

When it comes to data gathering, some of the most widely used frameworks and technologies are:

#01 Storm

Storm is an open-source framework for robust distributed real-time computation on streams of data. It supports a wide range of programming languages and storage facilities, and one of its main advantages is that it can be utilized in many data gathering scenarios including stream processing and distributed RPC for solving intensive functions on-the-fly.

It’s used by a number of big systems, with some of the largest ones being Groupon, The Weather Channel and Twitter.

#02 Simply Scalable Streaming System

Simply Scalable Streaming System or S4 is a distributed, general-purpose platform for developing applications that process streams of data which was launched by Yahoo! Inc. It is designed to work on commodity hardware, avoiding I/O bottlenecks by relying on an all-in- memory approach.

S4 provides a simple programming interface for processing data streams in a decentralized, symmetric, and pluggable architecture.

#03 Kafka

Kafka is a distributed publish-subscribe messaging system designed to support persistent messaging with high-throughput. It aims to unify offline and online processing with its ability to partition real-time consumption over a cluster of machines, and is built in a way that minimizes the network overhead and sequential disk operations.

It was originally developed at LinkedIn to track the huge volume of activity events generated by the website.

#04 Flume

Flume is a service whose purpose is to provide a distributed, reliable and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. Its architecture is based on streaming data flows — it is simple and flexible, but also robust and fault tolerant with tuneable reliability mechanisms and many failover and recovery mechanisms.

Flume was designed with these four key goals in mind — reliability, scalability, manageability and extensibility.

#05 Hadoop

Hadoop is an open-source project that focuses on developing a framework for reliable, scalable, and distributed computing on big data using clusters of commodity hardware.

It’s used and supported by a large number of big organizations like Facebook, AOL, Baidu, IBM, Imageshack, and Yahoo.

Interested in reading more about big data?

We have a couple of more blogs covering big data, such as:

And if this is something you would like to take a deep look into – download our free ebook “ Value of data — the business side of data gathering, processing and visualization”!