Variable Fingerprint Size & Better Testing

- Added fingerprint size option to `cf.init`
	After some reasoning, I concluded that having the user choose
	fingerprint size is the best choice in terms of API design. 
	Users care about error rate but it is not possible to have them
	directly decide it as it is the result of choosing fingerprint size
	and bucket size, settings that in turn define many other performance
	characteristics of the filter.
	Considering Redis' nature and common usage patterns,
	it's very important to keep the memory layout efficient in terms of 
	access. This means having a few reasonably useful settings that allow
	keeping all buckets word or half-word aligned.
	A fpsize setting seems reasonable considering that users already
	have to implement a fingerprinting function and should be well aware
	of how long the fingerprint should be.
	With this new setting, users can create a filter with 1, 2 or 4 bytes
	long fingerprints which in turn cap the filter's error rate respectively
	to 0.03, 0.0001 and 0.000000001.
	The first error rate (0.03) is the point where Cuckoo filters become 
	more efficient than Bloom filters, the others follow from doubling the 
	fingerprint size. 
	For 1 and 2 fingerprint size, bucket size is 4.
	For 4, bucketsize is 2.
	This makes all bucket fetches to be performed in a single word read.

- Added reference python implementation and testing 
	Adding variable fingerprint size made the C code much more complex.
	To understand reasonably well what the code is supposed to be doing, I
	wrote a pure python implementation that is both very easy to understand
	and very slow. 
	The standard selftest suite has been also adapted to the python code
	and a "lockstep" testing methodology has devised: each test operation
	applied to the Redis module is also applied to the python instance and
	before proceeding to the next, both filters are serialized and confronted
	to make sure all operations produce byte-to-byte equal results.
	This test is very slow but ensures a reasonable certainty that the Python
	and C code behave consistently. This paired with the fact that the Python
	code is very easy to understand should help build and test new features in
	the future.

- Added fingerprint size option to `cf.init` After some reasoning, I concluded that having the user choose fingerprint size is the best choice in terms of API design. Users care about error rate but it is not possible to have them directly decide it as it is the result of choosing fingerprint size and bucket size, settings that in turn define many other performance characteristics of the filter. Considering Redis' nature and common usage patterns, it's very important to keep the memory layout efficient in terms of access. This means having a few reasonably useful settings that allow keeping all buckets word or half-word aligned. A fpsize setting seems reasonable considering that users already have to implement a fingerprinting function and should be well aware of how long the fingerprint should be. With this new setting, users can create a filter with 1, 2 or 4 bytes long fingerprints which in turn cap the filter's error rate respectively to 0.03, 0.0001 and 0.000000001. The first error rate (0.03) is the point where Cuckoo filters become more efficient than Bloom filters, the others follow from doubling the fingerprint size. For 1 and 2 fingerprint size, bucket size is 4. For 4, bucketsize is 2. This makes all bucket fetches to be performed in a single word read. - Added reference python implementation and testing Adding variable fingerprint size made the C code much more complex. To understand reasonably well what the code is supposed to be doing, I wrote a pure python implementation that is both very easy to understand and very slow. The standard selftest suite has been also adapted to the python code and a "lockstep" testing methodology has devised: each test operation applied to the Redis module is also applied to the python instance and before proceeding to the next, both filters are serialized and confronted to make sure all operations produce byte-to-byte equal results. This test is very slow but ensures a reasonable certainty that the Python and C code behave consistently. This paired with the fact that the Python code is very easy to understand should help build and test new features in the future.

redis-cuckoofilter

More Zig Projects

ghostty

zig

browser

roc

Variable Fingerprint Size & Better Testing

More Zig Projects

ghostty

zig

browser

roc