A theoretical and empirical study of affirmative sampling
Tutor / Supervisor
Student
Montes Sanabria, Jordi
Document type
Master thesis
Date
2020
rights
Restricted access - author's decision
Publisher
Universitat Politècnica de Catalunya
UPCommons
Abstract
Distinct random sampling is widely used in many applications due to its ability to answer aggregate queries on data with provable probabilistic guarantees on the quality of the answer. We consider the data stream model where a single pass over the data is allowed and only a small number of elements in the stream can be stored. We analyze Affirmative Sampling, an algorithm for distinct random sampling with adaptive sample size. We discuss an unbiased cardinality estimator of the stream using the elements in the random sample from Affirmative sampling and provide bounds for its accuracy. Random samples can be used for much more than just cardinality estimation. We introduce Nooh, a fast genome distance estimation tool that uses Affirmative sampling to compete with the current state-of-the-art tools in bioinformatics.
