If you manage ‘Big Data’ then you’ve probably run into this problem: your Big Data is getting too big. Before you know it you are moving around files that take 20 or 30 hours to transfer and many hours to compress, decompress or even to back up. Six months after installing a new batch of backup servers they are maxed out and you have to add more. As a result you are constantly pressing users to reduce the amount of data you keep accessible for them. At Moonshadow we host and serve more than one billion records for our customers and we used to run into this problem constantly. A few years ago we came up with a simple idea to solve this problem: if we could shrink the footprint of each record we could reduce the size of the data. And if we could do this in the native format of the datastore then we would not have to deal with compression and decompression.
Over the last five years we have been developing our patent-pending proprietary datastore “Ephemeris” and recently reached an interesting milestone: we can store big data in less than one byte per value for databases that include a mixture of boolean, numeric and text fields. As an example, one population database we work with is the Registered Voters Database for California from our customer L2 Political. This database includes 16 million records and 649 columns storing more than 10 billion values. This database includes 211 text fields, a field type that usually uses four bytes per character.
This database takes up approximately 92GB in Microsoft SQL Server. Ephemeris stores these 10 billion values in just 8.9GB which comes to less than 1 byte per value. We were able to shrink the footprint by more than 90%. As a result we can fit 10 times as much data on the same server, our backup servers can store 10 times the number of records and when we move data around file transfer times are 10 times faster. In another example we have a database with 160 million records and 170 fields or a total of 27 billion values. Ephemeris takes just 16GB to store this data which comes to 0.63 bytes per value. This is smaller than the space that is taken up by the smallest datatype in a traditional SQL Server database. Using Ephemeris some of our customers reduced the number of servers required from several hundred to less than 90, resulting in savings of tens of thousands of dollars per month.
California Voters Database | MS SQL Server | Ephemeris |
---|---|---|
Footprint | 91.87 | 8.90 |
Records | 16,127,412 | 16,127,412 |
Fields | 649 | 649 |
Total # Values (records * fields) | 10,466,690,388 | 10,466,690,388 |
Average bytes per value | 9.42 | 0.91 |
Table 1. Ephemeris reduced the voters database for California by 90% (Data from L2 Political)
Composition of California Voters Database | Number of Fields | Percentage |
---|---|---|
Boolean | 384 | 59% |
Int | 48 | 7% |
Datetime | 4 | 1% |
Float | 2 | 0% |
Text | 211 | 33% |
Total number of fields | 649 | 100% |
Table 2. Field types in the Registered Voters Database for California from L2 Political
Ephemeris’ savings will vary depending on the type of data and the way it is composed. For population databases we are seeing savings of up to 90% but there will be databases where Ephemeris will not reduce the data footprint. To evaluate if Ephemeris can reduce the footprint of your data it is important to understand some of the basic principles of the Ephemeris technology.
Most traditional databases round the space they reserve per field up to the next whole byte (or, often, the next power of two in bytes). Ephemeris goes down to the bit level. If we can store a field value in five bits we use just those five bits and we use the remaining three bits for something else. In this way we pack the data very tightly and as a result Ephemeris can store much more data in the same amount of space without doing any compression. The table below compares the amount of bits used for different datatypes between Ephemeris and Microsft SQL Server.
MS SQL Server DataType | SQL Server (bytes) | SQL Server (bits) | Ephemeris (bits) | Average Savings(%) |
---|---|---|---|---|
boolean | 1 | 8 | 1 | 88% |
smallint | 2 | 16 | 2-16 | 44% |
integer | 4 | 32 | 2-32 | 38% |
bigint | 8 | 64 | 2-64 | 44% |
timestamp | 8 | 64 | 32 | 50% |
date | 3 | 24 | 16 | 33% |
real | 4 | 32 | 32 | 0% |
double precision real | 8 | 64 | 64 | 0% |
text | 4 + 1 per character | 32 + 8 per character | 2-32 | >90% |
point | 22 | 176 | 64 | 64% |
lseg | 38 | 304 | 16-64 | 89% |
box | 69 | 552 | 128 | 77% |
Table 3. Comparison: space used by Microsoft SQL Server and Ephemeris for different datatypes
Ephemeris is not only very prudent in how it uses space for the data values, Ephemeris also uses less data to store metadata such as Tuple overhead and TOAST Pointers. Table 2 below compares the amount of bits used for different types of metadata between Ephemeris and PostgreSQL.
PostgreSQL Tuple Overhead | PgSQL Server (bytes) | PgSQL Server (bits) | Ephemeris (bits) | Average Savings(%) |
---|---|---|---|---|
Tuple Header | 24 | 192 | 2-4 | 88% |
Tuple Pointer | 4 | 32 | 0 | 100% |
Tuple Padding | 0-7 | 0-56 | 0-7 | 0-70% |
PostgreSQL Toast | PgSQL Server(bytes) | PgSQL Server(bits) | Ephemeris (bits) | Average Savings(%) |
TOAST Pointer | 17 | 136 | 2-32 | 88% |
Table 4. Comparison of space used by PostgreSQL Server and Ephemeris for Tuple and TOAST data
In addition to packing fields very tightly Ephemeris uses other techniques to significantly reduce the data footprint. Ephemeris doesn’t store individual cell values per record for some of the fields, instead it stores cell value combinations. In a database table each field has a maximum number of possible values. In theory the possible number of combinations of cell values that occur in a database table is the smallest of (a) the number of Rows or (b) the multiplication of the number of possible values in column 1, column 2, column 3 . . . column N. In practice we find that there is a very high correlation of field values between a subset of the columns. In an address database for the US, for example, each ZIP-code will only occur in a small number of cities, one or two counties and in just one state. Instead of storing the ZIP, city, county and state as individual data for each record Ephemeris only stores the occurring combination of values for a record. We call this technique ‘Attribute Clumping’. Therefore the Ephemeris technology gives the best footprint reduction for databases where there is a high correlation between attribute values. It turns out that a lot of databases have high cell value correlations. The savings of Attribute Clumping are in addition to the savings for individual data types.
There are many other techniques besides attribute clumping and bit packing that we have developed. When we look at a new dataset we analyze its structure, the data and the relationships between the data values. If we believe we can realize significant savings we then develop an Ephemeris format that is small and can be traversed at very high speeds by the Ephemeris engine. For some of our customers we have reduced the number of servers they need by hundreds while also increasing the data access speeds. By using Ephemeris they reduced their monthly server costs dramatically.