Computers are powerful tools which perform fast, accurate calculations over huge sets of data. However, many layers of abstraction are required to use computers for any given task. Recent advances in machine learning employ compute-intensive operations embedded in complex overall flows. Further, deployment of these systems must balance many concerns: accuracy, speed, energy, portability, and cost. Currently, for each target, a good implementation of the needed software layers requires many programmer-years of effort. To address this, we explore new tools and methods to amplify programmer effort for machine learning applications. In particular, we focus on portability and speed for machine learning operations, algorithms, and flows. Additionally, we wish to maintain accuracy and carefully control the complexity of the overall software system. First, we motivate our approach with a case study in developing libHOG, which provides high-speed primitives for calculating image gradient histograms, where we achieve a 3.6X speedup over the state of the art. Next, in DenseNet, we enable previously prohibitively slow multiscale sliding window object detection using dense convolutional neural network features. Finally, we propose our Boda framework for implementing artificial neural network computations, based on metaprogramming, specialization, and autotuning. In Boda, we explore in depth the development of efficient convolution operations across various types of hardware. With only a few months of effort, we achieve speed within 2X of the highly-tuned vendor library on NVIDIA Graphics Processing Units (GPUs). Further, in only a few weeks, we achieve up to 30% efficiency on Qualcomm mobile GPUs, where no vendor library exists.