5002 Data Stream Memo
Ideas
$\epsilon$-deficient synopsis
- Condition 1: There is no false negative (true one wil be true)
- The difference between estimated and true is at most $\epsilon$N.(error rate)
- True frequencies less than (s-$\epsilon$)N are classified as infrequent (the most error one are still infrequent)
Kind
- Sticky Sampling Algorithm
- Lossy Counting Algorithm
- Space-Saving Algorithm
Sticky Sampling Algorithm
properties
- using probability
- has confidence parameter $\delta$ (how confident your result is)
- support threshold s and error parameter $\epsilon$
bucket design
$t = \lceil 1/ \epsilon ln(s^{-1}\sigma^{-1})\rceil$ (decide the size of each bucket)
1st bucket: 1 ~ 2t, size = 2t, r = 1 2nd bucket: 2t+1 ~ 4t, size = 2t, r = 2 3rd bucket: 4t ~ 8t, size = 4t, r = 4 … (r = 8, size = 8t…)
algorithm
![[Pasted image 20250516204827.png]]
- 1st bucket, it just absort everything into entries.
- change to 2nd bucket, toss coin to decrease and diminish counter to diminish used memo. (memo change to 2t / 2 = t)
- 2nd bucket, toss coin to add counter (upper bounder of added memo: 2t * 1/2 = t, prev memo = t)
- change to 3rd bucket, memo change to 2t * 1/4 = t / 2
- 3rd bucket, upper bounder of added memo: 4t * 1/4 = t
- Output: $f + \epsilon N \geq sN$ (considering error)
feature
- $\epsilon$-deficient synopsis with probability at least $1-\delta$
- at most 2t memo used on average
Lossy Counting Algorithm
properties
- using fix-sized bucket
- support threshold and error parameter bucket size = $\lceil \frac{1}{\epsilon} \rceil$ bucket index: count starting 1. i.e. ($b_\text{current} = \lceil\frac{N}{w}\rceil$)
algorithm
![[Pasted image 20250516210217.png]]
- 1st bucket, we just insert, with $\Delta = \text{id of bucket} - 1$
- changing bucket, remove $f + \Delta \leq \text{id of bucket}$
- 2nd bucket, we insert, new created entry will get new $\Delta$
- changing bucket, we remove…
- output: $f + \epsilon N \geq sN$ (considering error)
feature
- 100% $\epsilon$-deficient synopsis
- at most $\lceil 1/\epsilon log(\epsilon N) \rceil$ entries (related to N, i.e. data you read)
Space-saving algorithm
properties
- just support threshold s and memory parameter M(the greatest number of possible entries stored in the memory)
algorithm
![[Pasted image 20250516210935.png]] When memory full, we remove those entries with smallest $p_e = f + \Delta$, and add new insert as $(e, 1, p_e)$
We output: $f + \Delta \geq sN$ (consider error)
feature
- greatest error: $E \leq 1/M$
- if $E \leq \epsilon$, we make sure $\epsilon$-deficient synopsis
- Memory consumption: M