Desbordante 2.4.0
Hi everyone! Almost seven months have passed since our last release, and we are excited to introduce the new version of Desbordante. This update is packed with features and includes the following user-facing improvements:
-
Validation of Constant Denial Constraints (DCs): We have enhanced the validation functionality for denial constraints by adding the ability to handle constant DCs. Additionally, we included examples demonstrating how to use the validation of this pattern, including data cleaning examples that utilize exception discovery capabilities.
-
Graph Functional Dependencies (GFDs) Discovery: Desbordante now includes an algorithm for searching GFDs within datasets. Preliminary experiments have shown that processing a couple of million vertices can be done in a reasonable time on a standard computer. An example of its usage has been added.
-
Validation of Differential Dependencies (DDs): We have introduced basic functionality for validating DDs. Currently, it does not support passing user-defined metrics, only the built-in ones can be used. As usual for validation tasks, there is an exception discovery feature, along with an example of how to use it.
-
Validation of Conditional Functional Dependencies (CFDs): We have added the capability to validate CFDs from the core of Desbordante. The new version operates on partitions and should be advantageous when validating multiple CFDs. In the future, we plan to introduce a straightforward algorithm (without index construction) that will be optimal for validating a single CFD. A separate usage example has been included, along with the ability to discover exceptions.
-
Validation of Matching Dependencies (MDs): We have added validation for the most expressive type of pattern, which includes exception discovery capabilities. This pattern is capable of capturing subtle inconsistencies in data by utilizing various matching functions. Custom user functions are supported. An example demonstrating data cleaning techniques using this pattern has been added.
-
Revamped Example for Fuzzy Algebraic Constraints (AC): The example for fuzzy algebraic constraints has been completely redesigned. Now, users can understand the algorithm and its parameters without referring to the article, along with a clear example illustrating the underlying concepts.
-
Colab Notebooks with Examples: We aim to lower the entry barrier for potential users of Desbordante, and to this end, we now provide notebooks with examples in addition to examples in the Desbordante-core repository. These notebooks can be run in Google Colab (without needing to install our pip package on your machine) to familiarize users with the profiler and help them choose the appropriate pattern. Ultimately, we plan to include clickable links in the README.md to the notebook versions of the examples. In this release, over ten patterns have received their example versions in the form of notebooks.
-
Serialization of Found Patterns: We have added the ability to serialize certain types of patterns and plan to implement this across all supported patterns in the future. Besides the obvious benefits, serialization serves as the foundation for Reflejo — a profiling module that operates in digest mode. Reflejo is similar to traditional profilers in that it takes a dataset, a search profile, and returns the identified patterns. Unlike the core package, Reflejo focuses on providing an overview of the dataset rather than user-directed spot checks. Additionally, it offers various functionalities such as pattern search tuning based on time limits, management of profiling results, and failure response scenarios. We will be releasing this pip package soon, so stay tuned for announcements.
Finally, we have published our first major article about Desbordante, which provides a high-level overview of the tool, presents its architecture, and outlines our vision and future plans. You can read it here: https://dl.acm.org/doi/10.1145/3703323.3703725.