Filtered-Space Saving Algorithm

2022/04/11

Filtered-Space Saving Algorithm

Space-Saving algorithm

基本思想是维护预定数量的 m 个元素及其关联的计数器。每个计数器都会更新，以反映一个元素的最大可能出现次数，并且这种估算方式可能存在误差。

k 个元素，初始化计数器为 0 。
准确计算前 k 个不同的元素。
碰到下一个新元素：
- 如果已经在集合中，计数器累加。
- 如果不在集合中，替换计数最少的元素，计数器累加。当替换一个元素时，可以将旧的计数器保留，作为“高估” （overestimate）值

算法分析

最小计数器的值为 min ，min 不会超过 n/k 。
未计数的元素真值介于 0 和 min 之间。
任何真值大于 n/k 的元素会被保留。
所以最终元素集合中元素计数 > n/k , 出错计数 <= n/k

实验笔记

这是数据流方向中研究最多的问题之一， Space-Saving 在众多算法中有最好的性能。为了节省空间，还有许多实现细节：用哈希表检查元素是否存在于集合中？用堆找出当前最小的值？等等。准确度和内存使用上取得良好平衡：在现实数据上要到达 100% 准确度仅仅只要一点额外的空间（通常是 1.1 倍的 K）

拓展

元素带有权值 Weight items (find items whose total weight is high)
支持删除元素（权值为负数）Support deletion of items (negtive weight)
方差大 Heavy Changers (which items heve largest change)
离散度大 Distinct heavy hitters (which sources contact the most distinct addresses)
权值随时间衰减 Time Decay (weight of items decay with age)

额外思考

最新的研究领域（过去 20年的），越来越被重视的
时间和空间复杂度影响很大
不同的算法做出不同的权衡（空间 vs 时间 vs 准确性）

MORE

“Probabilistic Counting Algorithms for Database Applications” - Flajolet and Martin (1985)
“The space complexity of approximating the frequency moments“ - Alon, Noga; Matias, Yossi; Szegedy, Mario (1999)
Data Streams: Models and Algorithms - Charu C. Aggarwal • Implementations of heavy hitters algorithms (and more!) at: http://www.research.att.com/~marioh/frequentitems

参考