With the emergence of a plethora of Large Language Models (LLMs) to date, the future of having LLMs run locally at the edge has come closer and closer with every passing day. However, there has not been as much work on smaller language models that can potentially solve tasks where it would be inefficient to run a full LLM at scale. In this paper, we explore Small Language Models (SLMs) and how we can make them more efficient at the edge without sacrificing performance. Pruning or simplifying SLMs can cause a significant degradation of downstream performance. To this end, we investigate weight reparameterization and knowledge distillation as two avenues for these small language models to mitigate these pitfalls. This study investigates the structure of the FFN module in the transformer architecture in order to improve the inference speed of these language models for short sequence length tasks. We also investigate whether we can distill from these LLMs into significantly smaller SLMs in order to take advantage of the plethora of pretrained models available to the public. We find that when simplifying the FFN module, one can use weight reparameterization at training time to help the model converge and improve downstream accuracy. We also find that knowledge distillation may not be a surefire way to improve the downstream model performance as the difference between the model capacities of these LLMs and small language models may be difficult to overcome.




Download Full History