Big Data Engineering Services

Mobile apps, IoT devices, connected vehicles, cloud servers and other devices generate massive amounts of data. Processing and enriching billions of records requires specialized engineering skills that few organizations have. Here’s where Moonshadow can help. We can develop entire data processing pipelines or speed up that one process that is your bottleneck. Trillions of records have been processed and enriched by data pipelines developed by our engineers.

When we re-engineer data pipelines or storage formats we often achieve over 90% efficiency gains for structured data without adding hardware. A ten-times reduction in storage size not only means the storage costs are reduced by 90%, it also means the data is retrieved and transmitted ten times faster.

These are substantial claims. The following will give you an idea on how we achieve these remarkable results.

Size Matters

When we start with a project we first analyze the current data storage formats and data streams. We have developed extremely efficient patented data compression technology that compresses columnar data by over 90% lossless. In some cases, such as with time-series data, the more data we have the higher the compression rate. Where 10 second interval data may have a 90% compression ratio, 1 second data may have a 95% compression ratio. A 10x increase in data may only result in 2x increase in size.

We use a variety of techniques to reach these compression levels. Instead of storing data per byte we work on the bit level thereby removing a lot of ‘empty’ space that is populating most data storage formats. Our MMZIP compression tool analyzes the first million (or ten million) rows to automatically design the best compression approach. If the format or the composition of the records changes MMZIP will automatically re-evaluate and adapt its compression strategy.

Some clients receive data at an unused level of precision. Latitude/Longitude data is often provided at a precision of one centimeter (0.4 inches) whereas the GPS devices generating this data have a precision of 5 meters (16 feet). By removing this inaccurate level of precision we make additional size reductions.

Less is More

Smaller data = faster transmission = lower cost. Data that is reduced 90% in size is transmitted 10x faster than the uncompressed data. A file that takes an hour to transmit is now sent over in six minutes. This is true at all levels. Reading the data from disk is 10x faster and saving it to disk on the target hardware is 10x faster.

Not only are the transmission times lower, so are the costs as cloud providers typically charge per TB of transferred data. The same is true for the storage costs which means organizations can use servers with smaller disks that are much less expensive.

Fast Compression/Decompression

The disadvantage of using compression is the time it takes to compress, and later decompress, the data. If compression and decompression take a long time the time savings are gone. If compression and decompression use expensive servers the cost savings are gone as well.

We designed our flagship compression tool, MMZIP, so it is fast and runs on inexpensive servers. It all depends on the composition of the data but MMZIP typically processes several million records per second per server. MMZIP can achieve these results on small inexpensive servers, regardless of the size of the data that is compressed.

Because MMZIP is fast and runs on small inexpensive servers the time and cost savings of using the compression are much larger than the added time and expense for compression.

Streaming Compression

MMZIP, or custom data processing pipelines we design, use a streaming protocol. Streaming compression allows us to compress, or decompress, a one Terabyte file on a server with just 32GB of RAM. In our streaming protocol we read a data file in batches of one (or ten, or fifty) million records at a time, compress and output these, and fetch the next batch.

Streaming also means that compression, transmission and decompression can be done simultaneously. This can dramatically speed up the time it takes to send large files from one server to another.  As soon as the first batch of records is compressed the transmission of this batch starts, while MMZIP is compressing the second batch. Once the first batch is received decompression starts while the next batch is being transmitted and so on.

Data Filtering

We need to retain very large datasets but each project usually only needs a tiny portion of the data. We may be storing server statistics on 10,000 servers for the last three years, but when we analyze traffic we may be looking at just a few hours for a small selection of the servers. When analyzing traffic patterns we need access to all connected vehicle data for a state for a year but for a particular study we only need to look at the traffic across a small freeway section during peak hours.

The first step in using big data for analytics is selecting the tiny portion of the data that is relevant for the study. Once a smaller project dataset has been generated the existing tools to analyze the data can usually handle it.

Traditional compression tools require a dataset to be completely decompressed before the data can be filtered to create a smaller project dataset. This means that the entire uncompressed dataset needs to be written to disk and read by the filtering software. Depending on the data pipeline and the filtering tools this may require massive amounts of disk space and memory and reintroduces the need to use large expensive servers for a long time – and cloud providers charge by the minute.

MMZIP avoids these costs in time and money by filtering the data during decompression.  MMZIP can read in a batch with millions of records, filter these in memory using some given criteria, only output the filtered records and read in the next batch. In this way the full dataset never needs to be written to disk and the data selection can be done on a small inexpensive server. MMZIP can filter around one million records per second per server.