v2.0.0-rc.4
What's Changed
Breaking Changes ๐
- fix!: null handling when using
NOTwith scalar indices by @wjones127 in https://github.com/lance-format/lance/pull/5270 - feat!: track cumulative wall time in analyze plan by @wkalt in https://github.com/lance-format/lance/pull/5505
- fix!: check metric compatibility before using vector index by @wjones127 in https://github.com/lance-format/lance/pull/5609
- feat!: define default index name and return IndexMetadata after building index by @wjones127 in https://github.com/lance-format/lance/pull/5645
- feat!: make v2 manifest default by @wojiaodoubao in https://github.com/lance-format/lance/pull/5656
- refactor!: introduce storage options accessor by @jackye1995 in https://github.com/lance-format/lance/pull/5728
New Features ๐
- feat: support using FTS as a filter in vector search by @wojiaodoubao in https://github.com/lance-format/lance/pull/4928
- feat: support when_matched_delete in merge_insert by @jtuglu1 in https://github.com/lance-format/lance/pull/4939
- feat: add support for large minichunk size (u32) in format v2.2 by @niyue in https://github.com/lance-format/lance/pull/4959
- feat: support GEO RTree index by @ddupg in https://github.com/lance-format/lance/pull/5034
- feat: support global tag retrieval and improve tag api by @majin1102 in https://github.com/lance-format/lance/pull/5088
- feat: support create vector index distributedly by @chenghao-guo in https://github.com/lance-format/lance/pull/5117
- feat: support add sub-column to struct col by @wojiaodoubao in https://github.com/lance-format/lance/pull/5126
- feat: distributed range-based BTree index by @steFaiz in https://github.com/lance-format/lance/pull/5202
- feat: strategized plan compaction by @zhangyue19921010 in https://github.com/lance-format/lance/pull/5233
- feat: dataset supports deep_clone by @majin1102 in https://github.com/lance-format/lance/pull/5250
- feat: cleanup only scan managed files by @majin1102 in https://github.com/lance-format/lance/pull/5338
- feat: support map data type in lance format version 2.2 by @xloya in https://github.com/lance-format/lance/pull/5349
- feat: add RTree index spec in table format by @ddupg in https://github.com/lance-format/lance/pull/5360
- feat(java): support row lineage and cdf apis by @yanghua in https://github.com/lance-format/lance/pull/5362
- feat: disable default features on internal use by @valkum in https://github.com/lance-format/lance/pull/5372
- feat(cdf): support set start/end timestamp in cdf by @zhangyue19921010 in https://github.com/lance-format/lance/pull/5378
- feat(blob_v2): add external blob support by @Xuanwo in https://github.com/lance-format/lance/pull/5385
- feat(blob_v2): add dedicated blob support by @Xuanwo in https://github.com/lance-format/lance/pull/5406
- feat: fallback to CPU if GPU accelerating is unavailable by @BubbleCal in https://github.com/lance-format/lance/pull/5407
- feat(blob_v2): add packed blob support by @Xuanwo in https://github.com/lance-format/lance/pull/5413
- feat: allow python tracing / logging to be independently configured by @westonpace in https://github.com/lance-format/lance/pull/5415
- feat: add additional index APIs to support count rows split plan by @jackye1995 in https://github.com/lance-format/lance/pull/5447
- feat(java): support multi-bases for writing database by @ddupg in https://github.com/lance-format/lance/pull/5450
- feat(blob_v2): add BlobAray API for user input by @Xuanwo in https://github.com/lance-format/lance/pull/5451
- feat: upgrade lance-namespace to 0.3.1 and add missing apis by @jackye1995 in https://github.com/lance-format/lance/pull/5457
- feat(python): support cleanup_with_policy by @ddupg in https://github.com/lance-format/lance/pull/5458
- feat: support dropping sub-column of list(struct) by @wojiaodoubao in https://github.com/lance-format/lance/pull/5469
- feat(blob_v2): add GC support by @Xuanwo in https://github.com/lance-format/lance/pull/5473
- feat: add
py.typedmarker file by @jonded94 in https://github.com/lance-format/lance/pull/5479 - feat(python): expose the
distance_rangeparam in the Python scannernearestconfig by @xloya in https://github.com/lance-format/lance/pull/5486 - feat(java): simplify the use of optional in jni by @ddupg in https://github.com/lance-format/lance/pull/5488
- feat(blob_v2): add Python API for Blob v2 by @Xuanwo in https://github.com/lance-format/lance/pull/5491
- feat(python): add DatasetBasePath stub to improve IDE hints by @ddupg in https://github.com/lance-format/lance/pull/5503
- feat(memtest): add macos support by @Xuanwo in https://github.com/lance-format/lance/pull/5510
- feat(java): add full text search api by @wojiaodoubao in https://github.com/lance-format/lance/pull/5563
- feat: support credentials vending in directory namespace by @jackye1995 in https://github.com/lance-format/lance/pull/5566
- feat: upgrade lance-namespace to 0.4.0 by @jackye1995 in https://github.com/lance-format/lance/pull/5568
- feat: add skip_merge for FTS index build by @BubbleCal in https://github.com/lance-format/lance/pull/5570
- feat(java): add builder-style scalar index params by @wojiaodoubao in https://github.com/lance-format/lance/pull/5581
- feat: optimize rle implementation by @Xuanwo in https://github.com/lance-format/lance/pull/5586
- feat: support FixedSizeList by @wkalt in https://github.com/lance-format/lance/pull/5593
- feat: add dictionary encoding for 64bit types like int64/double by @Xuanwo in https://github.com/lance-format/lance/pull/5594
- feat: support merge_insert with source dedupe on first seen value by @jackye1995 in https://github.com/lance-format/lance/pull/5603
- feat: support truncate table api by @zhangyue19921010 in https://github.com/lance-format/lance/pull/5604
- feat: add Error::External variant for preserving user errors by @wjones127 in https://github.com/lance-format/lance/pull/5606
- feat: upgrade lance-namespace to 0.4.5 by @jackye1995 in https://github.com/lance-format/lance/pull/5611
- feat: refactor use of Error::io by @lichuang in https://github.com/lance-format/lance/pull/5612
- feat(java): add detached flag to commitTransaction by @wojiaodoubao in https://github.com/lance-format/lance/pull/5626
- feat: add parts_searched metrics for FTS by @BubbleCal in https://github.com/lance-format/lance/pull/5627
- feat: improve the random access file benchmark by @westonpace in https://github.com/lance-format/lance/pull/5628
- feat(oss): add sts token support for aliyun oss via storage_options by @hh23485 in https://github.com/lance-format/lance/pull/5632
- feat: merge-insert with primary key dedupe by @jackye1995 in https://github.com/lance-format/lance/pull/5633
- feat(java): expose index description and statistics by @majin1102 in https://github.com/lance-format/lance/pull/5655
- feat: allow configure temp dir size for datafusion exec by @jackye1995 in https://github.com/lance-format/lance/pull/5659
- feat(java): add support for optimizing indices by @majin1102 in https://github.com/lance-format/lance/pull/5663
- feat: make on arg optional for merge insert api by @yanghua in https://github.com/lance-format/lance/pull/5667
- feat: make OneShotPartitionStream pub by @timsaucer in https://github.com/lance-format/lance/pull/5672
- feat: support array_contains in LabelList scalar index by @fenfeng9 in https://github.com/lance-format/lance/pull/5681
- feat: add order to primary key by @touch-of-grey in https://github.com/lance-format/lance/pull/5683
- feat: use independent region manifest for MemWAL by @touch-of-grey in https://github.com/lance-format/lance/pull/5689
- feat: add stats() method to ObjectStoreRegistry by @wkalt in https://github.com/lance-format/lance/pull/5706
- feat: support dynamic context for lance namespace by @jackye1995 in https://github.com/lance-format/lance/pull/5710
- feat: make blob v2 dedicated threshold configurable by @yanghua in https://github.com/lance-format/lance/pull/5719
- feat: cleanup partial idx files when merging distributed vector index by @yanghua in https://github.com/lance-format/lance/pull/5729
- feat: expose blob handling APIs to python by @Xuanwo in https://github.com/lance-format/lance/pull/5790
- feat: add blob handling support for fragment by @Xuanwo in https://github.com/lance-format/lance/pull/5801
Bug Fixes ๐
- fix: correct null_count aggregation in boolean statistics collection by @YinZheng-Sun in https://github.com/lance-format/lance/pull/4839
- fix: remove logging for project_batch by @westonpace in https://github.com/lance-format/lance/pull/5267
- fix: stop documenting FTS index type, standardize on INVERTED by @mackrorysd in https://github.com/lance-format/lance/pull/5315
- fix: don't allow change blob version during update by @Xuanwo in https://github.com/lance-format/lance/pull/5386
- fix: take_blobs_by_indices fails with stable row IDs on fragment 1+ by @jmhsieh in https://github.com/lance-format/lance/pull/5392
- fix: respect index metric when user overrides by @BubbleCal in https://github.com/lance-format/lance/pull/5395
- fix: remove expensive clone in bitmap search by @westonpace in https://github.com/lance-format/lance/pull/5409
- fix: fix vector index prewarm index by @xloya in https://github.com/lance-format/lance/pull/5412
- fix: panic unwrap on None in decoder.rs by @camilesing in https://github.com/lance-format/lance/pull/5424
- fix: dir namespace cloud storage path removes one subdir level by @jackye1995 in https://github.com/lance-format/lance/pull/5464
- fix: make column name lookups case-insensitive by @wjones127 in https://github.com/lance-format/lance/pull/5465
- fix: ensure trailing slash is normalized in rest adapter by @jackye1995 in https://github.com/lance-format/lance/pull/5499
- fix(java): support FixedSizeList for java LanceField by @fangbo in https://github.com/lance-format/lance/pull/5509
- fix: head external manifest object happend 404 NotFound error by @hushengquan in https://github.com/lance-format/lance/pull/5512
- fix: json's arrow extension metadata missing by @Xuanwo in https://github.com/lance-format/lance/pull/5527
- fix: infer multivector sampling rows by @BubbleCal in https://github.com/lance-format/lance/pull/5534
- fix: support ManifestNamingSchemeV2 with unordered object stores by @wjones127 in https://github.com/lance-format/lance/pull/5539
- fix: merge_insert uses full schema path for reordered columns by @wjones127 in https://github.com/lance-format/lance/pull/5541
- fix: allow storage options provider without expires_at_millis by @jackye1995 in https://github.com/lance-format/lance/pull/5542
- fix(ci): use pull_request_target for fork PR reviews by @wjones127 in https://github.com/lance-format/lance/pull/5544
- fix: restore decrease max_fragment_id in manifest by @majin1102 in https://github.com/lance-format/lance/pull/5554
- fix: improve error handling for environment variable parsing by @XuQianJin-Stars in https://github.com/lance-format/lance/pull/5560
- fix: panic when lance.auto_cleanup.interval is set to 0 by @majin1102 in https://github.com/lance-format/lance/pull/5571
- fix(python): correct type hint for to_tensor_fn parameter by @AndreaBozzo in https://github.com/lance-format/lance/pull/5577
- fix: avoid panic while hitting non-null empty multi-vector by @Xuanwo in https://github.com/lance-format/lance/pull/5588
- fix: filter garbage entries from null maps during encoding by @wkalt in https://github.com/lance-format/lance/pull/5591
- fix: reduce verbosity of errors due to string conversion by @wjones127 in https://github.com/lance-format/lance/pull/5600
- fix: remove imports that are not needed by @westonpace in https://github.com/lance-format/lance/pull/5651
- fix: allow nearest applied in default_scan_options by @chenghao-guo in https://github.com/lance-format/lance/pull/5666
- fix: trait Array has been sealed in arrow new version by @Xuanwo in https://github.com/lance-format/lance/pull/5690
- fix: project_by_schema now reorders fields inside List types by @wjones127 in https://github.com/lance-format/lance/pull/5703
- fix: allocate too much memory for block max scores by @BubbleCal in https://github.com/lance-format/lance/pull/5718
- docs: in dataset.rs, fix comment for get_fragments by @cmccabe in https://github.com/lance-format/lance/pull/5724
- fix(python): close SQLite connections in BatchUDFCheckpoint by @wjones127 in https://github.com/lance-format/lance/pull/5733
- fix: remove credential vending features from python and java bindings by @jackye1995 in https://github.com/lance-format/lance/pull/5737
- fix: allow unused_unsafe for __cpuid to support both stable and nightly by @jackye1995 in https://github.com/lance-format/lance/pull/5793
- fix: fix remap so that it handles deletions correctly by @westonpace in https://github.com/lance-format/lance/pull/5828
Documentation ๐
- docs: fix Append call in distributed write guide by @rongou in https://github.com/lance-format/lance/pull/5439
- docs: fix and improve the description about row id by @yanghua in https://github.com/lance-format/lance/pull/5463
- docs: add specification for handling indices by @wjones127 in https://github.com/lance-format/lance/pull/5543
- docs: fix duplicate words in comments and error messages by @XuQianJin-Stars in https://github.com/lance-format/lance/pull/5548
- docs: add research paper link to the landing page by @prrao87 in https://github.com/lance-format/lance/pull/5549
- docs: auto-build refactored namespace integrations doc by @jackye1995 in https://github.com/lance-format/lance/pull/5562
- docs: rename RowIdTreeMap to RowAddrTreeMap in rtree.md by @ddupg in https://github.com/lance-format/lance/pull/5564
- docs: add docs for DuckDB extension by @prrao87 in https://github.com/lance-format/lance/pull/5578
- docs: update Lance-DuckDB docs to latest version 0.4.1 by @prrao87 in https://github.com/lance-format/lance/pull/5613
Performance Improvements ๐
- perf: do not instrument self in multipart upload by @westonpace in https://github.com/lance-format/lance/pull/5416
- perf: various btree performance improvements by @westonpace in https://github.com/lance-format/lance/pull/5446
- perf: reuse session context by @wjones127 in https://github.com/lance-format/lance/pull/5462
- perf: offload IVF partition build to CPU pool by @BubbleCal in https://github.com/lance-format/lance/pull/5551
- perf: materialize the tokens after WAND done by @BubbleCal in https://github.com/lance-format/lance/pull/5572
- perf: compute HNSW level counts after build by @BubbleCal in https://github.com/lance-format/lance/pull/5590
- perf: improve SQ query speed by @BubbleCal in https://github.com/lance-format/lance/pull/5596
- perf: reuse zstd compressors in encoding by @wkalt in https://github.com/lance-format/lance/pull/5598
- perf: use binary search to skip documents by @BubbleCal in https://github.com/lance-format/lance/pull/5636
- perf: improve FTS indexing perf and reduce memory footprint by @BubbleCal in https://github.com/lance-format/lance/pull/5650
- perf: avoid copying tokens while merging by @BubbleCal in https://github.com/lance-format/lance/pull/5661
- perf: tighten WAND block score upper bound by @BubbleCal in https://github.com/lance-format/lance/pull/5668
- perf: cache global BM25 idf per query by @BubbleCal in https://github.com/lance-format/lance/pull/5727
- perf: use LRU cache for session contexts in get_session_context by @wjones127 in https://github.com/lance-format/lance/pull/5736
- perf: merge partitions in stream style by @BubbleCal in https://github.com/lance-format/lance/pull/5754
Other Changes
- refactor: write bitmap index statistics in file instead by @Xuanwo in https://github.com/lance-format/lance/pull/5251
- refactor: rename RowIdTreeMap to RowAddrTreeMap by @yanghua in https://github.com/lance-format/lance/pull/5266
- refactor: rename RowIdMask to RowAddrMask by @yanghua in https://github.com/lance-format/lance/pull/5281
- refactor: consolidate logic between zonemap and bloomfilter indexes by @fenfeng9 in https://github.com/lance-format/lance/pull/5374
- refactor: split dataset tests in a tests mod by @Xuanwo in https://github.com/lance-format/lance/pull/5387
- refactor: use the same path for dedicated and packed blob by @Xuanwo in https://github.com/lance-format/lance/pull/5449
- refactor: add store_prefix to lance-io's ObjectStore by @cmccabe in https://github.com/lance-format/lance/pull/5468
- refactor: expose take_blobs_by_addresses to python by @Xuanwo in https://github.com/lance-format/lance/pull/5474
- refactor: support java 21, drop java 8 by @cmccabe in https://github.com/lance-format/lance/pull/5565
- refactor: allow switching to bitpack inside RLE by @Xuanwo in https://github.com/lance-format/lance/pull/5595
- refactor: introduce RowSetOps and refactor RowAddrTreeMap by @yanghua in https://github.com/lance-format/lance/pull/5624
- refactor(python): migrate torch.jit.script to torch.compile by @wjones127 in https://github.com/lance-format/lance/pull/5759
- test: fix tests broken by pandas 3 release by @westonpace in https://github.com/lance-format/lance/pull/5786
Full Changelog: https://github.com/lance-format/lance/compare/release-root/2.0.0-beta.N...v2.0.0-rc.4