When using connected vehicle data for transportation analytics one of the biggest challenges is the sheer size of the data. A typical dataset with a month of data can contain five billion waypoint records and working with datasets this size is difficult and time consuming. The good news is that any individual transportation analytics project dataset only uses a small percentage of the data. For analyzing the freeway use in a county, for instance, we want to create a dataset that only contains the waypoints in the county and only those that are generated while the vehicle was driving on a freeway, on-ramp or off-ramp. This may amount to only 0.5% of the data or 25 million records. Datasets this size are much easier to use for analytics. Selecting 25 million records out of a repository with five billion records, however, can take many days of server time.

Moonshadow developed MMZIP to address this problem and we recently used MMZIP for a project in Alabama with Wejo data. This Alabama dataset contains just over five billion waypoints per month. MMZIP, as the name suggests, is a compression tool. The original Wejo data was 770GB in size. MMMZIP compressed this data to 33GB or 4% of its original size. Long-term storage costs of this data are, therefore, reduced by 96% and reading the data from disk is 25x faster. Creating the compressed MMZIP repository took six hours on a modest 8GB RAM server but this only needs to be done once.

MMZIP shows how much memory is used by each column
to provide data engineers with crucial information to reduce size.

MMZIP was then used to select all waypoints in Madison County that were generated while a vehicle was driving on a highway or freeway. The Wejo data did not include the county or road attributes but MMZIP has the ability to assign these attributes to the waypoint data and filter by them at the same time. It took MMZIP 2 hours and 22 minutes to create a dataset with all waypoints on highways and freeways in Madison County. A server with over 800GB of RAM is very expensive but MMZIP uses a streaming protocol for enrichment, filtering and decompression so the 770GB of data never needs to be in memory at the same time. The server requirements are therefore very modest; MMZIP never used more than 3GB of RAM to generate the ‘Madison Highways’ dataset. MMZIP selected 29 million records, or slightly more than 0.5%, of the data and exported this as a CSV. This project showed MMZIP can read, enrich, filter and output two billion waypoint records per hour per server. Storing large data repositories in MMZIP format not only reduces storage costs, creating project datasets can now be done quickly and inexpensively.

MMZIP can compress any type of columnar ‘spreadsheet’ data. The tool is optimized for time-series geospatial data generated by mobile apps, connected vehicles or IoT devices. MMZIP is available as a Linux command-line tool as well as a C-library that you can run on your own servers in the cloud or on-prem. Data engineers can embed MMZIP in their data processing and data distribution pipelines. Moonshadow can customize MMZIP to support specific customer file formats to further increase compression ratios and decompression, filtering or data enrichment speeds. Please contact Moonshadow Mobile for more information on MMZIP.