• michaelthould

State-of-the-Art Language Modelling Results are Possible with Simple Architectures

Updated: May 30, 2019

Statistical language models apply probability distributions to a sequence of words. These models are finding increasing use as natural language processing applications become more ubiquitous. A wide range of applications such as speech recognition, machine translation, part-of-speech tagging, chatbot implementations, handwriting recognition, information retrieval, and others, use language models.

Today, language models enable users to ask Siri where the nearest restaurant is, or walk into a dark kitchen and ask Alexa to switch on smart lights. Google recently demoed an AI agent that called businesses to book appointments. Language modelling is turning what was once science fiction into reality.

The models operate by predicting subsequent tokens from the data provided by the preceding tokens. Natural language modelling is by nature more complex than formal or programming language modelling where word usage can be precisely defined. Natural languages are not designed and do not have a formal specification. Even as natural languages have large numbers of terms and multiple ways to use them, these ambiguities are a challenge for machine learning.

Language models can be classified as either character-level or word-level. While character-level models have the advantage of requiring less memory (e.g., 26 letters in the English alphabet) when compared to word-level modes (171,476 words in use according to the Oxford Dictionary), character-level models are constrained by the vanishing gradients problem encountered by neural networks.


The vanishing gradient problem is machine learning issue that is present when training artificial neural networks with gradient-based learning methods and backpropagation. At each iteration, the neural network weights receives an update proportional to the partial derivative of the error function with respect to the current weight. In some cases, the gradient becomes vanishingly small to the point that weight effectively cannot change its value.

Recent research has shown that LSTM or QRNNs can be tuned to achieve state-of-the-art results on both character and word-level datasets using modern GPUs. While a recurrent neural network (RNN) exhibits temporal dynamic behavior due to the connections between nodes forming a directed graph along a temporal sequence, a quasi-recurrent neural network (QRNN) allows multiple parallel calculations.


This results in improved performance making QRNNs ideal for applications such as language modelling. LSTMs display a further improvement over QRNNs in language modelling. Since an LSTM stores information in memory, this allows RNNs to remember inputs over time, similar to the memory functions of a computer. LSTMs are therefore able to read, write, and delete the information it stores. In conclusion, using LSTMs and QRNNs can deliver state-of-the-art results with word level or character level models without relying on complex architectures.


Contact Fusion Professionals and let’s discuss how we can help explore and deploy these research-based language modelling applications as well as a range of other best in class data analytics technologies that can further enhance your company’s business intelligence.

Insights