Or how to compress terabytes of data in 32GB of RAM

A waypoint dataset for an area can quickly exceed ten billion records. An uncompressed dataset with ten billion waypoints will take three terabytes of space or more depending on the data in the columns. Servers with terabytes of RAM are expensive; Amazon AWS charges $13 per hour for a 2TB server which comes to over $9,000 per month. Servers with 32GB of RAM start around $0.35 per hour, or $250 per month, so there is a big cost incentive to compress and decompress large datasets on servers with little memory.

Moonshadow’s MMZIP uses a streaming protocol to compress large datasets to solve this problem. MMZIP streams waypoints in, compresses these, then streams them out with an internal buffer never having more than a few million waypoints. In this way a dataset with terabytes of data never needs to be in memory at the same time and it can be compressed and decompressed on small servers. When using MMZIP you can specify how many records it should read-in at a time so data engineers can optimize this for the servers and data they have. MMZIP usually reduces the size of waypoint data by over 90% lossless. Generic compression technologies such as GZIP and BZIP2 have lower compression ratios for waypoint data. GZIP and BZIP2 files can be two times larger and take twice as long to transfer.

Streaming compression with MMZIP provides significant cost savings but there is another benefit that may be even more important for certain use cases: simultaneous compression, transfer and decompression. MMZIP was created to speed up data transfers from one cloud server to another and to load data into memory faster. Once the compression server has written out the records to standard-out transmission of these records can start immediately. The decompression server at another location can start decompressing as soon as the first records arrive. This means that compression, transfer and decompression are all taking place at the same time for different parts of the dataset. This speeds up the entire transfer process and it can reduce latency for real-time applications.

The cost and time savings don’t stop here. Once the data has been transferred it usually needs to be decompressed, enriched and filtered. When using GZIP or BZIP you typically have to decompress all data before you can enrich the data and then select the data for a project. Project datasets often use less than 0.5% of the data in a dataset but to select the data you need to decompress the entire dataset. This means you still need to have terabytes of disk space to store the data and either have terabytes of RAM or split up the filtering process. MMZIP does the enrichment and filtering inside the decompression engine and outputs only the data that is selected. This means that a 32GB RAM decompression server is sufficient for decompressing, enriching and filtering datasets with tens of billions of records and terabytes of data. As the decompression is faster than the compression or transmission, project datasets can be ready within minutes of the last records arriving.

MMZIP is licensed as a set of Linux tools that data pipeline engineers can install on their own servers and integrate in their data processing pipelines.