This thesis presents Ressort, an auto-tuning framework for computational patterns that perform any kind of data-dependent data reordering or transformation. These programs, which we call shuffle kernels, account for large fractions of the runtime of database workloads and other applications. Hardware-conscious optimizations of shuffle kernels are all alike in that they generally consist of choosing one of many possible ways to decompose a particular kernel into a pipeline of more primitive operations: a sort might consist of a hash-based partitioning followed by an in-cache quicksort, or a join might entail sorting and then merging, for example. Ressort presents a domain-specific language (DSL) that enables the succinct expression of these compositions while also exposing the nested array-style parallelism that they imply, and supplies a compiler to exploit it. It also includes a parallel C-like intermediate representation that assists in the generation of performant shuffle kernel implementations. We present the design and implementation of these two language layers, and evaluate the performance of their output on a variety of hardware targets under different algorithmic requirements.