Markov Decision Processes (MDPs) have been the dominant formalism used to mathematically state and investigate the problem of reinforcement learning. Classical algorithms like value iteration and policy iteration can compute optimal policies for MDPs in time polynomial in the description of the MDP. This is fine for small problems but makes it impractical to apply these algorithms to real world MDPs where the number of states is enormous, even infinite. Another drawback is that these algorithms assume that the MDP parameters are precisely known. To quantify learning in an unknown MDP, the notion of regret has been defined and studied in the literature.
This dissertation consists of two parts. In the first part, we study two methods that have been proposed to handle large MDPs. PEGASUS is a policy search method that uses simulators and approximate linear programming is a general methodology that tries to obtain a good policy by solving linear programs of reasonable size. We give performance bounds for policies produced by these methods. In the second part, we study the problem of learning an unknown MDP. We begin by considering bounded parameter MDPs. These arise when we have confidence intervals associated with each MDP parameter. Finally, we give a new algorithm that achieves logarithmic regret in an irreducible but otherwise unknown MDP. This is a provably optimal rate up to a constant.