Description
Dataframes are a popular abstraction to represent, prepare, and analyze data. Despite the remarkable success of dataframe libraries in R and Python, dataframes face performance issues even on moderately large datasets. Moreover, there is significant ambiguity regarding dataframe semantics. In this thesis, we discuss the implications of signature dataframe features including flexible schemas, ordering, row/column equivalence, and data/meta-data fluidity, as well as the piecemeal, trial-and-error-based approach to interacting with dataframes. While most modern systems aim to scale dataframe workloads by changing properties of dataframes – or by adding new distributed systems knowledge requirements– we believe it is important to support scalable ops on dataframes without changing their semantics. This thesis takes a ground-up approach towards scaling dataframe systems,starting with a formal data model and algebra, and ending with a reference implementation. This implementation, Modin, has already accumulated a significant amount of community support: over 6,000 GitHub stars and over 1 million installs to date. This interest shows the need for systems that solve modern data science problems without changing semantics. Included in this thesis are several of our insights into how to build systems for data scientists and what data scientists prioritize. We believe these insights were instrumental in unlocking the interest and support from the community in our open source work.