Automatic speech recognition is a key technology for enabling rich human-computer interaction in emerging applications. Hidden Markov Model (HMM) based recognition approaches are widely used for modeling the human speech process by constructing probabilistic estimates of the underlying word sequence from an acoustic signal. High-accuracy speech recognition, however, requires complex models, large vocabulary sizes, and exploration of a very large search space, making the computation too intense for current personal and mobile platforms. In this paper, we explore opportunities for parallelizing the HMM based Viterbi search algorithm typically used for large-vocabulary continuous speech recognition (LVCSR), and present an efficient implementation on current many-core platforms. For the case study, we use a recognition model of 50,000 English words, with more than 500,000 word bigram transitions, and one million hidden states. We examine important implementation tradeoffs for shared-memory single-chip many-core processors by implementing LVCSR on the NVIDIA G80 Graphics Processing Unit (GPU) in Compute Unified Device Architecture (CUDA), leading to significant speedups. This work is an important step forward for LVCSR-based applications to leverage many-core processors in achieving real-time performance on personal and mobile computing platforms.