Both the California Consumer Privacy Act (CCPA) and Europe’s General Data Protection Regulation (GDPR) include the “right to be forgotten.” Consumers in California and Europe can ask organizations to delete all data about them and businesses are required to comply with each request. This is challenging enough for databases built around people where the primary key is a personID and all data about a person is linked to that. It can be difficult to locate all backups and remove all data to a person from those. Many databases are updated automatically from external sources and organizations need to put filters in place to prevent consumers that have opted out to be added automatically with data updates from external sources.
This problem, however, is much more difficult when working with the enormous databases collected through mobile apps or connected vehicles that are updated every second. This data usually does not have an ID or a name for the consumer as the primary key is often the ID of the mobile device, the connected vehicle or the trip. It is often possible, however, to derive the identity of a consumer by combining sleep or stop locations with other databases that do contain home addresses for people. The CCPA specifically names geolocation data as “Personal Information” that can be used to identify people.
The first problem that makes this challenging for mobility data is the sheer size. A data aggregator like Wejo, for instance, receives tens of billions of movement updates from OEMs every day. This comes to more than one million updates per second during peak hours. For real-time traffic solutions all those records need to be checked in-real time.
Some OEMs change the primary key of this data at least once per day but some companies change it as frequently as every 15 minutes. Changing the primary key frequently makes it much harder to link a trip to a person – and therefor it makes it also much harder to find and delete the data related to a person. Connected vehicle data is usually stored in files organized by time. These datasets are too large to store and index in a conventional database environment so the data is not easily searchable. In a nationwide database with CV data every file may contain data for any place in the country. This means you have to open and check every single file to see if it contains data about a specific person. Removing a person from all data for the last year means you have to load trillions of records from disk to RAM and check each one of them. This alone can be an expensive process. Under CCPA businesses have 15 days to respond and requests from consumers come in constantly. This means that organizations need to run their data purge routines at least every other week. Another problem is that, by their very nature, vehicles and mobile devices move around. Checking the data for someone’s home area is not sufficient. Companies are obligated to also remove the data about a Los Angeles resident if they are visiting San Francisco – or Maine.
MMZIP solves this challenge in several ways. MMZIP reduces waypoint data by over 95%. This not only means that storage costs are reduced by 95% but also that the data is loaded from disk to RAM 20 times faster. With data this large just loading the data can consume weeks of server time. MMZIP uses a streaming protocol; it reads in millions of records, processes these, writes the results to disk and reads in the next set of records. A one terabyte file can therefore be processed on an inexpensive server with just 32GB of RAM and it is not cost prohibitive to set up a series of parallel servers that remove data constantly. Unlike other compression tools MMZIP does not require the data to be decompressed before it can perform operations on the data. MMZIP can read in the compressed data, decompress it internally, find the Origins and Destinations that need to be excluded, determine the TripIDs that need to be removed, filter out all waypoints for these trips and write a new compressed MMZIP file to disk. All of this is done without ever writing the uncompressed data to disk which saves enormously on disk I/O and processing time. MMZIP is fast; it does filtering and enrichment processes at speeds of around one million records per second per server. This comes to over 80 billion records per day per server. A one-year national connected vehicle dataset, however, can consist of trillions of records that can still take weeks of server time to filter. With four to eight parallel servers MMZIP can remove opt-out consumers every week from a national dataset with trillions of records. Without MMZIP this process would take much longer and require much more expensive servers.
MMZIP is available as a command-line tool as well as a C-library that engineers can fully integrate into their own waypoint data processing pipeline on their own servers. For more information email wander@moonshadow.com.