In this thesis, we apply the pattern recognition and data processing strengths of machine learning to accomplish traffic analysis objectives. Traffic analysis relies on the use of observable features of encrypted traffic in order to infer plaintext contents. We apply a clustering technique to HTTPS encrypted traffic on websites covering medical, legal and financial topics and achieve accuracy rates ranging from 64% - 99% when identifying traffic within each website. The total number of URLs considered on each page ranged from 176 to 366. We present our results along with a justification of the machine learning techniques employed and an evaluation which explores the impact on accuracy of variations in amount of training data, number of clustering algorithm invocations, and convergence threshold. Our technique represents a significant improvement over previous techniques which have achieved similar accuracy, albeit with the aid of supporting assumptions simplifying traffic analysis. We examine these assumptions more closely and present results suggesting that two assumptions, browser cache configuration and selection of webpages for evaluation, can have considerable impact on analysis. Additionally, we propose a set of minimum evaluation standards for improved quality in traffic analysis evaluations.




Download Full History