What are the 2 forms of queries for a data stream system?
What is the stream model?
A model of processing data where input elements enter at a rapid rate at one or more input ports
What types of problems can be solved using data streams?
What is the sliding window?
An approach for answering data stream queries which ask about a window of the N most recent elements received
What is the issue that can occur with sliding window in terms of data size?
What is the issue with attempting to count the number of 1’s in a sliding window and how can we resolve it?
The issue is that we cannot afford to store N bits. We resolve this by using methods that provide us an approximate answer
What is the uniformity assumption for counting 1’s?
Assumes that 1’s and 0’s are distributed uniformly across the window so we can use earlier results to estimate
What is the equation when using the uniformity assumption for counting 1’s?
N * (S/(S+Z)) where S is the number of 1s from the beginning of the stream and Z is the number of zeros from the beginning of the stream
What is the DGIM method?
A method for counting 1’s that does not assume uniformity. It is never off by more than 50%
What is a DGIM bucket?
A record consisting of the timestamp of its end and the number of 1s between its beginning and its end. The DGIM method breaks the window into buckets where each bucket holds a number of 1s which is a power of 2
How can we represent a stream using DGIM buckets?
Start at the most recent elements. Create 1 or 2 buckets with the same power-of-2 number of 1s. The buckets do not overlap and buckets get increasingly bigger as we move to the left to less recent elements. Buckets disappear when their end-time is more than N time units in the past
How do we update DGIM buckets when a new bit arrives to the window?
How can we estimate the number of 1’s in the most recent N bits using DGIM?
How can we estimate the number of 1’s in the most recent k bits where k < N?
What is the goal of filtering data streams?
Determining which tuples of the stream are in a given list of keys S
Why can a hash table not be used to filter a data stream?
We do not have enough memory to store the entire list of keys we are looking for as we could be processing millions of filters on the same stream
What is a bloom filter?
A technique for filtering data streams made up of an array of bits with a number of hash functions
How is a bloom filter used to detect if elements have been seen before?
What is the issue with a bloom filter?
It can give false positives as hash collisions can occur
What is the probability of a false positive in a bloom filter?
Depends on density of 1’s and the number of hash functions
=(fraction of 1’s)^# of hash functions
What is the goal of estimating moments of a data stream?
Estimating summary statistics like average and standard deviation of the last k elements
What is the formula for the kth moment?
Sum of all (mi)^k for all i in A where A is a set of N values and mi is the number of times i occurs in the stream
What are the special case moments?
What is AMS?
A method for estimating moments of a stream