Description
Due to their utility, GPUs in a serverless setting are often highly in demand and limited in supply leading to overwhelmed systems. Users of these GPUs often have tight cost or latency constraints (for example, inference for self-driving cars). In this thesis, I explore scheduling policies to efficiently allocate GPU resources to requests based on user-provided Service Level Objectives. I further consider a heterogenous set of resources (both CPUs and GPUs), and explore how policies rooted in admission control can prevent the system from being overwhelmed.
While exploring GPU scheduling in a FaaS setting, there are some inherent system limitations. Most existing solutions require that applications carefully design their tasks to manually share resources and clean up properly when they finish, providing overhead for application developers and leaving scope for inefficient resource utilization. Another challenge is that when initializing a new accelerator resource, there is significant startup latency due to container and language runtime initialization. Recently, Nathan Pemberton proposed a new Kernel as a Service (KaaS) paradigm, where the system is responsible for managing GPU memory and schedules user kernels across the entire pool of available GPUs rather than relying on static allocations. Since resources are managed at the system-level with KaaS, it opens up a new set of challenges around request scheduling. I explore various policies, and evaluate them for metrics like cold start minimization, average wait time, and fairness among other metrics.