Following the addition of French topical analysis to Relative Insight, we spoke to our Data Scientist, Daniel Jordan, to delve deeper into the challenges of building a language model.
Why did you decide to build your own framework, rather than use the widely available translation services?
DJ: French is one of the world’s most spoken languages. When you include second language speakers, it is the fourth most spoken language in the world. It’s spoken across a wide variety of areas and regions across the world such as Quebec in Canada, a number of former French colonies such as Mali and Algeria, and French overseas territories such as Réunion or Martinique.
It’s the official language of the Olympics and is one of the primary languages of the European Union. It often serves as a unifying bridge between places and communities that may otherwise struggle to communicate effectively.
There is a substantial amount of information available in French, especially on an official level. Many valuable resources are exclusively accessible in this language. Our customers will require a comprehensive understanding of their data.
Let’s look at it from a data science perspective. Why does native language analysis have such an advantage over translation?
DJ: There are several different things that you might consider, such as errors or misinterpretations.
A machine that translates from one language to another is going to miss certain subtleties of the language, especially when it comes to idioms.
For example, “sucrer les fraises,” which literally translates as “to sugar your strawberries” actually means to have trembling hands. The translation doesn’t convey the intended meaning without the right context, but the problems with analysis will be much bigger.
You will end up with topics that categorize the phrase into food or sweetness – but perhaps it is an indication of Parkinson’s disease or older age?
On the other hand, if you paid for human-level translation, that is idiomatic and appropriate, it will be very expensive and time-consuming. It’s also difficult to understand whether your translation is accurate when you are not a native speaker yourself.
What leads to these translation errors?
DJ: Sometimes, an MT service will translate word-by-word, rather than taking in the context of the whole sentence. As such, this can lead to a disaster where “crème brûlée” is translated as “scorched cream”!
At Relative Insight, our native language analysis avoids these errors. It’s important to note that these types of issues are created by the act of translation. By just working with French natively, you avoid these problems entirely.
What are the benefits of working in a language natively?
DJ: When conducting analysis in your native language, you can actually assess the results and determine if they are appropriate. This fosters trust and confidence in our customers, as they can rely on the system to provide accurate information. It’s important to establish this trust and reliability by working with users in their native language, particularly in the ethnocentric tech industry that often prioritizes English and Western culture.
Discover insights with native precision and experience text analysis at its finest
In terms of language capabilities, how is Relative Insight different from the competitors?
DJ: When it comes to language analysis, the power of comparison is still there. The math in the background gives you confidence in the findings that are being shown through statistical significance. The strength of confidence is still exactly the same as it is for English.
We do not regard French as a secondary or lesser option; rather, we approach it with the same level of precision and scientific rigor that we apply to English, treating it as a native language in our handling and analysis. This applies to German and Spanish too!
Many text analytics software options may provide language choices, but often, they really just offer a translation service, treating non-English languages the same as English. It’s crucial to recognize that every language in the world is unique, with its own intricacies and idiosyncrasies. So, handling each language with the same level of detail and commitment is important – it is a complete product offering in another language.
Can you give us some background on how it was developed?
DJ: For me, one of the most intriguing aspects was the construction of the topic lexicon. It enables you to get all relevant topics during language analysis and conduct comparisons. I found myself examining the semantic lexicon really carefully trying to moderate content and identify mistakes.
For example, there were instances where words like “gay” or “lesbian” were incorrectly categorized under unethical or religious contexts. Rectifying these mistakes is gratifying because it makes you feel like you’re doing a diligent job.
We also encounter more philosophical challenges. For example, distinguishing between topics related to sexuality and those falling under LGBT categories can be intricate.
Additionally, deciding between actions considered unethical and those classified as criminal can be difficult, given the variations in legal frameworks between countries. What constitutes a crime in one jurisdiction may not hold the same status in another.
Furthermore, ethics often intertwine with cultural norms, so we’ve been building an understanding of these nuances and then trying to find the appropriate label within the lexicon to categorize them.
For the users, what is the impact of enabling native precision text analysis?
DJ: Establishing trust and reliability is so important when engaging with users in their native language, especially in the tech industry, which frequently places Western culture at the forefront of priorities.
With over 300 million French speakers, it’s genuinely motivating to know that we’re extending our reach to communities that might otherwise lack accessible tools in their language – especially since so many people rely on translation to be able to work with English language tooling.