Z Order Optimization for Generic Multi Dimensional predicates
Data Skipping logic constitutes an integral part of Advanced Table formats storing huge data sets. Read this blog to understand how Z order optimization helps these formats to achieve it effectively when predicate patterns are not known prior.
Data skipping at source is very important for executing queries against large data tables based on advanced table formats, such as Delta and Iceberg. Generally, data in these table formats is physically organized into multiple folders (also identified as a partition), where each folder contains a bunch of data files in a specific format, such as parquet.
Data skipping in these table formats generally works in two ways,
First at the folder/partition level — In this scheme, one or more folders can be identified from the query predicate (filter expression on one or more columns), and therefore the query execution engine reads only these specific folders thereby skipping data in all the other folders.
However, as widely understood, skipping at folder level works well if it is based on one or more low cardinality column(s) because this would ensure that the data files are not too small, since too many small files would explode the metadata for the data files and incur significant maintenance and operational costs.
Second at the file level — In this scheme of things, after identifying the folders, specific files are identified within the identified folder(s) based on the query predicate, and therefore the query execution engine reads only the specific data files skipping all the other data files.
Skipping at the file level is bit tricky, and is usually achieved by maintaining statistics (like min/max) about various columns in the file level metadata. However, this works well only in cases when a particular range of data (in accordance with certain column value(s)) stays in specific data file(s) instead of spreading across large number of data files.
We can of-course use Range Partitioning to restricts range of values to a set of data file(s), however that works only when predicate is restricted to a single column.
Therefore, if we have predicate on any combination of multiple columns then range partitioning would not result in the desired outcome, and this is where lies the importance of Z ordering.
Z ordering captures locality in multi dimensional space by means of single dimension only similar to a Geohash. By applying Z order principles, a new column, capturing multidimensional locality awareness, is computed against all those columns that could form the part of any predicate formulation. After this computation, range partitioning is used on the computed column to distribute data among multiple data files.
Distributing data in such a way ensures that any range of data with respect to one or more desired column(s) limits to specific data files only, thereby increasing the efficiency of data skipping for any predicate formed on values from a subset of columns out of all possible columns that can be used in a predicate.
To Illustrate, below is the data layout for the four data files, when a Dataset consisting of two columns ‘x’ and ‘y’ are range partitioned on ‘x, y’. Here, although range partitioning is done on ‘x, y’, locality is captured only on ‘x’, therefore if ‘y’ is used in predicate all four data files needs to be searched thru.
On the other hand, if Z ordering is used to capture locality with respect to both ‘X’ and ‘Y’, the resultant data distribution in four file ideally would like:
In the Z ordered based distribution of data among data files, one can easily see that predicate works effectively on either X, either Y or both X and Y.
I hope now you could appreciate why Z order optimization is very important for large Table formats in effectively skipping large sets of irrelevant data when a multi dimensional predicate is provided while reading a large table. In case there is a doubt/queries or you have any feedback on this story, you could reach out to me @ LinkedIn