MMZIP Reduces Connected Vehicle Data processing time and costs by over 90%
Moonshadow’s MMZIP compresses, enriches and filters data from connected vehicles, fleet software, mobile apps and IoT devices. MMZIP is provided as a set of executables that data processing engineers can install on their own servers and integrate into their own data processing pipeline. This blog posting describes a common use case that data engineers need to perform on connected vehicle (CV) data and estimates the savings in time and cost that can be realized using MMZIP.
We looked at the following use case:
- Compress one billion waypoint records
- Download the data
- Decompress the data
- Assign a road segment to each waypoint
- Assign five regions to each waypoint
- Filter the data to create a workable project dataset
- Create a project dataset with just the selected records
This is a very typical series of steps that data engineers need to perform before they can use connected vehicle waypoint data for a project. Most data providers deliver the data in a GZIP format so we compared MMZIP to GZIP. Both compression technologies are lossless. We also estimated the time and costs if no compression technology was used and this is listed below as NoZIP.
GZIP is a general-purpose compression technology that was developed for text-based information that is stored in rows. MMZIP is a compression technology that is created specifically for time-series columnar data such as data that is collected in connected vehicles, mobile apps and IoT devices. In our tests we used a movement dataset from Wejo. The table below shows the differences in time between the three solutions normalized for one billion records.
These times will vary as they depend on many different factors such as the number of columns, the fields, the content, the cleanliness of the data, the hardware used, internet connection speeds and much more. The overall picture, however, is clear; for each billion records processed MMZIP saves almost 12 hours of server time compared to GZIP. This comes to a 86% reduction in processing time.
The above table shows that MMZIP compression is roughly three times better for connected vehicle data than GZIP. There are a number of reasons why MMZIP files are so much smaller. GZIP was developed primarily for text and it actually achieves higher compression ratios than MMZIP for row based text data. MMZIP is optimized for columnar compression of time-series data. Instead of storing the actual values MMZIP stores the differences between updates and compresses these. The range of possible values of the deltas is much smaller than storing the actual values and as a result this compresses at a much higher ratio. The Wejo data is relatively clean and well normalized. What’s more, the Wejo data has a consistent update interval of three seconds and the MMZIP compression algorithms achieve a higher compression ratio the higher the update frequency. GZIP compressed the Wejo data by 70% whereas MMZIP achieves compression ratios of 90% or more. The smaller MMZIP files transfer faster leading to the first savings in time.
Apart from the transfer times there is a second part that leads to significant time savings. In our use case example MMZIP performs the above tasks 3 through 7 in memory simultaneously. A GZIP solution needs to decompress the data first and then perform the enrichments. The data is then usually loaded into a database and indexed before the database can be queried to create a project dataset. MMZIP performs all of these tasks without ever saving these large datasets to disk or loading them into a database. This results in very significant processing time savings. MMZIP includes optimized routines to assign road segments and area assignments at very high speeds; it performs the six road and area assignments at a speed of over 500,000 waypoints per second.
MMZIP does not only provide time savings, it also lowers the processing costs. MMZIP performs all these tasks while streaming the data one million records at a time. This means that the one billion records never need to be in memory at the same time and MMZIP can run on a modest server with 32GB of RAM. A GZIP solution, and especially a NoZIP solution, requires servers with more RAM and larger disks. This means that these servers are much more expensive. The following table shows the difference in cost between the three solutions using AWS standard EC2 pricing.
This table shows MMZIP saves over 90% in processing costs compared to GZIP as the server needed for the GZIP solution is more expensive than the MMZIP server and needs to be used for a much longer time. Because the GZIP solution needs to store the data uncompressed in a database to select the data for a project dataset it requires much more disk space so the storage costs are much higher.
Apart from the server costs AWS also charges for transferring the data to outside of AWS. The following table compares the download costs using AWS standard rate sheet of transfer volumes over 50TB/mo. This cost needs to be added if the data is processed outside of AWS.
This example is a fairly standard process that connected vehicle data engineers perform every day on billions of records. It demonstrates that MMZIP can realize significant cost savings while speeding up the data processing by seven times. CV data engineers need to perform many other tasks where MMZIP can realize similar savings in processing time and costs. Please email info@moonshadow.com or call us at 541.343.4281 for more information about MMZIP.