Machine learning (ML) is a cornerstone of the new data revolution. Most attempts to scale machine learning to massive datasets focus on parallelization on computer clusters. The BIDMach project instead explores the untapped potential (especially from GPU and SIMD hardware) inside individual machines. Through careful codesign of algorithms and "rooflining", we have demonstrated multiple orders of magnitude speedup over other systems. In fact, BIDMach running on a single machine exceeds the performance of cluster systems on most common ML tasks, and has run computer-intensive tasks on 10-terabyte datasets. We can further show that BIDMach runs at close to the theoretical limits imposed by CPU/GPU, memory or network bandwidth. BIDMach includes several innovations to make the data modeling process more agile and effective: likelihood "mixins" and interactive modeling using Gibbs sampling. These results are very encouraging but the greatest potential for future hardware-leveraged machine learning appears to be on MCMC algorithms: We can bring the performance of sample-based Bayesian inference up close to symbolic methods. This opens the possibility for a general-purpose "engine" for machine learning whose performance matches specialized methods. We demonstrate this approach on a specific problem (Latent Dirichlet Allocation), and discuss the general case. Finally we explore scaling ML to clusters. In order to benefit from parallelization, rooflined nodes require very high network bandwidth. We show that the aggregators (reducers) on other systems do not scale, and are not adequate for this task. We describe two new approaches, butterfly mixing and "Kylix" which cover the requirements of machine learning and graph algorithms respectively. We give roofline bounds for both approaches.