Making Big Data Agile

[article]
Summary:

Big data is defined as web-scale, large quantities of data ranging to several TeraBytes (TB) or PetaBytes (PB). Big data is inherently difficult to manage due to its sheer size and free format, which is best summarized by the three Vs (volume, velocity, and variety). 

Big data is defined as web-scale, large quantities of data ranging to several TeraBytes (TB) or PetaBytes (PB). Big data is inherently difficult to manage due to its sheer size and free format, which is best summarized by the three Vs (volume, velocity, and variety). Big data is not well aligned with the agile recommendations of simplicity, efficiency, and collaboration. Let's explore how to make big data agile.

Collecting Only Relevant Data

As data gets bigger, data filtering at a data source becomes essential to keep the data size within limits. As an example, if database log data is to be collected, what data attributes are most important? Some log data could be repetitive. Should the repetitive data be collected, or is collecting the first instance of such data sufficient? If some configuration settings are being output, should those be collected as well, or should those be referenced in configuration files? As an example of filtering, the Fluentd framework for data collection provides a Filter plugin that could be used to filter events from event streams by obtaining the value of some of the fields. Selected event stream fields could also be masked or deleted. Apache Flume, another framework for data aggregation, provides interceptors to filter events based on configured regular expressions. The filtering could be include- or exclude-based.

Processing and Storing Only Relevant Data

Not all collected data may be useful for processing or storage. Data preparation is used before data processing and analysis and storage. Data processing could be performed in-memory to extract relevant data that is to be presented to a user. Which data is to be stored depends on the type of application. As an example, Yahoo Mail has no option but to store all email messages, but Yahoo does limit the storage size to 1 TB per Yahoo Mail account. As another example, is it important for an online store to archive all user session data, such as when a user logged in and logged out and what a user browsed or searched, or is session data of not much significance? As big data continues to grow exponentially, it has become increasingly important to process and store only the most relevant data. The data storage tools or databases used is also an important factor. Increasingly, NoSQL databases are replacing relational databases, because web data is mostly schema-free and unstructured. Data may need to be split into manageable chunks or transformed for processing and storage.

Using Efficient Data Exchange and Storage Formats

The data exchange and storage formats to use is another factor to consider. The most suitable data storage format would depend on the type of data. Several storage formats are available including Apache Avro, Apache Parquet, JSON, protocol buffers, and XML. Some of the choices to make are:

  • Text data or binary data

  • Is data stored row-wise or is columnar storage used

  • Is stored data compressed 

  • Are complex data structures used

  • Is a schema to be used

As an example, both XML and JSON are text-based formats that provide some structure to data and support a schema to be associated. XML uses paired tags while JSON is key/value. JSON is more compact as it has fewer metadata and no tags. Apache Avro provides varied data structures to store binary data, is fast and compact, and makes use of schemas. Apache Parquet is a columnar storage format with support for efficient compression and encoding schemes. The data exchange format to use is based on whether data is text or binary, and the available network bandwidth. Most web services used XML as the data exchange format in the beginning but then shifted to JSON because it is more compact and faster to process. Protocol buffers format is preferred over XML for data exchange as the data is compact, fast, and requires less network bandwidth.

Using Lower Retention Period Thresholds for Historical Data

A lot of data is of significance for a short duration, and a retention period is usually set for archiving historical or other data. As an example, Yahoo Mail deletes Spam folder messages after 30 days. Google Chrome browser stores browsing history for 90 days by default. Using lower thresholds for retaining historical and non-essential data reduces the quantity of data that needs to be stored.

Using a Different Storage Media for Archiving

Data to be archived does not have to be stored on the same node or machine on which data computation and analysis is performed. As an example, Apache Hadoop supports alternatives to the default DISK storage such as SSD storage, which is more suitable for archiving data. Hadoop also supports a different storage type for data archiving called ARCHIVE, which has high storage density and not much compute power. Nodes used for computing data are not suitable for data storage because of the different characteristics of each. Also, data that is frequently used or queried should be more accessible than data that is maintained for archival reasons.

Streaming Real-Time Data

With the volume and velocity of new data growing it has become imperative to stream data in real-time. Several frameworks for streaming data are available. Apache Kafka is a commonly used distributed streaming framework with a publish/subscribe model. Kafka is similar to a messaging system in that messages or records are stored in topics that may be polled by topic subscribers. Kafka is used for developing real-time data pipelines and applications. Apache Flink is a distributed processing framework for data streams. Flink is suitable for developing streaming and batch applications. Event-driven streaming applications perform some data processing based on an event, such as a new batch of data being sent. Flink is also suitable for data analytics on continuously streaming data. Apache Samza is a framework for processing real-time data from diverse data sources. A useful feature in Samza is stateful stream processing in which the data state is saved locally. An example of stateful stream processing is counting the number of page views per user per hour. Aggregation with sum or count functions requires keeping state.

Support for Containerization

When Apache Hadoop and other big data frameworks first started to be used, containerization platforms such as Docker and Kubernetes were not available. With the inherent benefits of using lightweight containers that make efficient use of the underlying OS kernel, big data processing frameworks have started to support containerization by providing Docker images and other optimizations. Big data has grown even bigger than when big data first began to be processed and stored. To keep big data manageable and agile, it has become important to collect, store, and process only relevant data and not all data that is generated.

 

About the author

CMCrossroads is a TechWell community.

Through conferences, training, consulting, and online resources, TechWell helps you develop and deliver great software every day.