v0.5.0
Happy New Year Qbeasters!! 🤶 🤶 🤶 A lot of things have happened in the past 2023... and we'd like to thank your support with this release.
What we have accomplished in the past months:
- String Histogram Indexing. Partitioning lexicographic variables is a big pain in storage. But thanks to a few lines of pre-processing and a lot of backend engineering, we can group similar text values in the same file, so fine-grained data skipping on String IDs becomes an efficient task.
You need to:
1. Compute Histogram
import org.apache.spark.sql.delta.skipping.MultiDimClusteringFunctions
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{col, min}
def getStringHistogramStr(df: DataFrame, columnName: String, numBins: Int): String = {
val binStarts = "__bin_starts"
val stringPartitionColumn = MultiDimClusteringFunctions.range_partition_id(col(columnName), numBins)
df
.select(columnName)
.distinct()
.na.drop
.groupBy(stringPartitionColumn)
.agg(min(columnName).alias(binStarts))
.select(binStarts)
.orderBy(binStarts)
.collect()
.map { r =>
val s = r.getAs[String](0)
s"'$s'"
}
.mkString("[", ",", "]")
}
val histogram = getStringHistogramStr(df, "test_col_name", 100)
2. Configure the Qbeast table as follows:
val columnStats = s"""{"test_col_name_histogram":$histogram}"""
df
.write
.format("qbeast")
.option("columnsToIndex", s"test_col_name:histogram")
.option("columnStats", columnStats)
.save(targetPath)
- Upgrade to Spark 3.4.1 and Delta 2.4.0. Read what is new in: #211
- Add Delta File Skipping. #235 To enable full compatibility with other formats, we allow non-indexed files to be written in the same table. These files should be skipped efficiently, even if no multidimensional indexing is applied, so we need to rely on the underlying file format for this task. In this case, we add the Delta File Skipping feature.
- Fixed #246 . If we create an existing Qbeast Table in Spark Catalog, there's no need to add any options such as
columnsToIndexorcubeSize. - Fixed #213. Now we can append data correctly with
INSERT INTOquery.
Contributors
@Jiaweihu08 @alexeiakimov @cdelfosse @osopardo1
Full Changelog: https://github.com/Qbeast-io/qbeast-spark/compare/v0.4.0...v0.5.0