MILES: A Multilingual Text Simplifier

The majority of us engage with written language almost every day; however, if the written language is too complex, it can sometimes become unavailable to certain people. Non-native speakers, or those with conditions such as aphasia, dyslexia or autism may struggle if a text has a tricky vocabulary or sentence structure, presenting the need for systems which can simplify text to make them more accessible.

In this post, I introduce the Multilingual Lexical Simplifier, or MILES for short (yes, the name was intentional). This was a project that I worked on back in 2020 whilst at university, to explore the potential for a text simplification app that could work with various languages.

For those interested, the project can be found on GitHub here.

What is Text Simplification?

Text Simplification is a natural language processing (NLP) task which aims to reduce the linguistic complexity of text to make it easier to understand, whilst retaining the original information and meaning.

Text Simplification can be split into two subtasks: Lexical Simplification, which aims to reduce complexity by replacing complex words with simpler synonyms; and Syntactic Simplification, which aims to reduce text complexity by altering its structure.

	Original Text	Simplified Text
Lexical Simplification	The ominous clouds engulfed the hill	The gloomy clouds covered the hill
Syntactic Simplification	The man, carrying numerous books, entered the room	The man entered the room. He was carrying numerous books.

Examples of Lexical and Syntactic Simplification

The challenge for low-resource languages

There are over 7000 languages spoken around the world, but the majority of them are considered “low-resource” languages.

Within the field of NLP, a low-resource language is one that lacks the necessary linguistic resources required for certain NLP tasks, such as parallel datasets and manually crafted tools like WordNets. This presents a problem, as in Africa and Asia alone, there are over 3 billion speakers of low-resource languages, meaning that many people around the world cannot benefit from NLP applications—including Text Simplification.

Building a Multilingual Lexical Simplifier

The architecture for MILES is loosely based on LSBert, an approach to Lexical Simplification proposed back in 2018 which uses the BERT large language model (LLM) to generate replacements for complex words. Instead of using the standard BERT LLM however, which only supports English, MILES makes use of Multilingual BERT, which supports 104 different languages.

To determine the best replacement for a complex word, MILES ranks the potential candidates across four different attributes:

Masked Language Model (MLM) Probability: The probability from Multilingual BERT on how likely it is for the candidate to replace the original word in the given context
Zipf Frequency: The frequency of the candidate across a large corpora
Cosine Similarity: The cosine similarity between the word embeddings for the original word and the candidate replacement
APSynP Similarity: A novel approach for measuring the similarity between word embeddings (paper can be found here)

Kane Miles

MILES: A Multilingual Text Simplifier

What is Text Simplification?

The challenge for low-resource languages

Building a Multilingual Lexical Simplifier

Leave a Reply Cancel reply