Bloom Filter

布隆过滤器，Bloom Filter，主要用于判断海量数据中是否存在某一个数据

布隆过滤器特别适用于海量数据中是否存在某一个数据，但有一定的误差，需要根据实际情况调整

布隆过滤器保证一个数据一定不存在，但不能保证一个数据是否存在

你永远可以相信布隆

原理

布隆过滤器使用多个hash函数，对于一个数据，将它映射到两个不同的地址上，并将该位置为1。多个数据映射可能会到同一个位上，这就造成了误差。

我们举一个例子，比如这是一个长为8的数组

1 2	index: 0 1 2 3 4 5 6 7 val: 0 0 0 0 0 0 0 0

假如插入数字3和6，并且最后根据两个hash函数分别映射到了0，1和0，5，那么这个数组就变成了

1 2	index: 0 1 2 3 4 5 6 7 val: 1 1 0 0 0 1 0 0

这时候，我们查询3是否在这个集合中，根据hash函数，可以计算出我们应该查找0和1这两个槽，发现他们都置为1，因此3存在于这个集合之中

但假如我们查询一个数字7是否在集合之中，假设7最后映射到了1，5而且数组中1，5都为1，那么尽管7不在数组中，依旧会返回true（即7在数组中）

除此之外，可能会存在多个数字某个hash函数映射到同一位上的情况，比如上文你无法判断数字7是否在这个数组之中

相比较而言，hashmap对所有冲突的情况，使用了红黑树和链表来解决冲突，保存信息，而布隆过滤器却没有使用

如果我们使用布隆过滤器的话，比如判断URL是否存在，就可以轻松保存在内存中了

当然，关于add的数目和size之间，也要需要慎重考虑，避免误差过大，影响正常使用

误差分析

布隆过滤器有一个可预测的误判率（FPP）：

n是已经添加的元素数量
k是hash的次数（下方代码为2）
m是布隆过滤器的长度（下方代码为size*32）

极端情况下，当布隆过滤器所有的位全都置1，则每次返回都是true

代码实现

使用Java语言编写

为了便于位运算，仿照hashmap使用了tableSizeFor函数，返回大于等于cap的最小2的n次方的值

使用了bitmap实现布隆过滤器，其实也可以使用boolean实现

代码见下

public class Bloom_Filter {
	private static final int MAXIMUM_CAPACITY = 32768;
	private int size;
	private int[] arr;

	public Bloom_Filter(int size) {
		size = tableSizeFor(size);
		arr = new int[size];
		this.size = size;
	}

	private int tableSizeFor(int cap) {
		int n = cap - 1;
		n |= n >>> 1;
		n |= n >>> 2;
		n |= n >>> 4;
		n |= n >>> 8;
		n |= n >>> 16;
		return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
	}

	public void add(Integer val) {
		int index;
		int bit;

		index = hash1(val) & (size * 32 - 1);
		bit = index & ((1 << 5) - 1);
		index = index >> 5;
		arr[index] |= (1 << bit);

		index = hash2(val) & (size * 32 - 1);
		bit = index & ((1 << 5) - 1);
		index = index >> 5;
		arr[index] |= (1 << bit);
	}

	public int hash1(Integer val) {
		int hash = val * 2017;//找了一个质数，下同
		return hash;
	}

	public int hash2(Integer val) {
		int hash = val * 2027;
		return hash;
	}

	public boolean contains(Integer val) {
		int index;
		int bit;

		index = hash1(val) & (size * 32 - 1);
		bit = index & ((1 << 5) - 1);
		index = index >> 5;
		boolean hash1 = ((arr[index] & (1 << bit)) != 0);

		index = hash2(val) & (size * 32 - 1);
		bit = index & ((1 << 5) - 1);
		index = index >> 5;
		boolean hash2 = ((arr[index] & (1 << bit)) != 0);

		return hash1 & hash2;
	}

}