The combined growth of data science and big datasets has increased the performance requirements for running data analysis experiments and work ows. However, popular data science toolkits such as Pandas have not adapted to the technical demands of modern multi- core, parallel hardware. As such, data scientists aiming to work with large quantities of data nd themselves either sunullering from libraries that under-utilize modern hardware or being forced to use big data processing tools that do not adapt well to the interactive nature of exploratory data analyses. In this report we present the foundations of Ray DataFrames, a library for large scale data analysis. Ray DataFrames emphasizes performant, parallel execution on big datasets previously deemed unwieldy for existing popular toolkits, all while importantly maintaining an interface and set of semantics similar to existing interactive data science tools. The experiments presented in this report demonstrate promising results towards developing a new generation of performant data science tools built for parallel and distributed modern hardware.




Download Full History