v0.5.0

Happy New Year Qbeasters!! 🤶 🤶 🤶 A lot of things have happened in the past 2023... and we'd like to thank your support with this release.

What we have accomplished in the past months:

String Histogram Indexing. Partitioning lexicographic variables is a big pain in storage. But thanks to a few lines of pre-processing and a lot of backend engineering, we can group similar text values in the same file, so fine-grained data skipping on String IDs becomes an efficient task.

You need to:

1. Compute Histogram

import org.apache.spark.sql.delta.skipping.MultiDimClusteringFunctions
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{col, min}

def getStringHistogramStr(df: DataFrame, columnName: String, numBins: Int): String = {
  val binStarts = "__bin_starts"
  val stringPartitionColumn = MultiDimClusteringFunctions.range_partition_id(col(columnName), numBins)
	
  df
  .select(columnName)
  .distinct()
  .na.drop
  .groupBy(stringPartitionColumn)
  .agg(min(columnName).alias(binStarts))
  .select(binStarts)
  .orderBy(binStarts)
  .collect()
  .map { r => 
    val s = r.getAs[String](0)
    s"'$s'"
  }
  .mkString("[", ",", "]")
}

val histogram = getStringHistogramStr(df, "test_col_name", 100)

2. Configure the Qbeast table as follows:

val columnStats = s"""{"test_col_name_histogram":$histogram}"""

df
  .write
  .format("qbeast")
  .option("columnsToIndex", s"test_col_name:histogram")
  .option("columnStats", columnStats)
  .save(targetPath)

Upgrade to Spark 3.4.1 and Delta 2.4.0. Read what is new in: #211
Add Delta File Skipping. #235 To enable full compatibility with other formats, we allow non-indexed files to be written in the same table. These files should be skipped efficiently, even if no multidimensional indexing is applied, so we need to rely on the underlying file format for this task. In this case, we add the Delta File Skipping feature.
Fixed #246 . If we create an existing Qbeast Table in Spark Catalog, there's no need to add any options such as columnsToIndex or cubeSize.
Fixed #213. Now we can append data correctly with INSERT INTO query.

Contributors

@Jiaweihu08 @alexeiakimov @cdelfosse @osopardo1

Full Changelog: https://github.com/Qbeast-io/qbeast-spark/compare/v0.4.0...v0.5.0

import org.apache.spark.sql.delta.skipping.MultiDimClusteringFunctions import org.apache.spark.sql.DataFrame import org.apache.spark.sql.functions.{col, min} def getStringHistogramStr(df: DataFrame, columnName: String, numBins: Int): String = { val binStarts = "__bin_starts" val stringPartitionColumn = MultiDimClusteringFunctions.range_partition_id(col(columnName), numBins) df .select(columnName) .distinct() .na.drop .groupBy(stringPartitionColumn) .agg(min(columnName).alias(binStarts)) .select(binStarts) .orderBy(binStarts) .collect() .map { r => val s = r.getAs[String](0) s"'$s'" } .mkString("[", ",", "]") } val histogram = getStringHistogramStr(df, "test_col_name", 100)

val columnStats = s"""{"test_col_name_histogram":$histogram}""" df .write .format("qbeast") .option("columnsToIndex", s"test_col_name:histogram") .option("columnStats", columnStats) .save(targetPath)

qbeast-spark

1. Compute Histogram

2. Configure the Qbeast table as follows:

Contributors

More Scala Projects

lila

prisma1

scala

akka-core

v0.5.0

1. Compute Histogram

2. Configure the Qbeast table as follows:

Contributors

More Scala Projects

lila

prisma1

scala

akka-core