This week's post was written by Lance Nathan, Senior Linguistics Developer at Luminoso.
We have exciting news to share this week! Luminoso’s software has officially mastered Korean, in addition to already knowing eleven other languages, including Chinese, Arabic, Russian, and of course English. In other words, it can process and analyze unstructured data in Korean without needing to translate it to and from another language like English first – a rather impressive feat, and one that’s unusual in the world of multilingual data analytics.
But how, exactly, is it possible for software to “learn” a language well enough to be able to natively analyze any type of data that’s thrown at it?
There are three steps that our linguistics team here at Luminoso must follow when expanding the software to understand new languages: 1) assessment; 2) implementation; and 3) refinement.
In the first step, we decide whether it’s actually possible for our software to understand a language well enough to run analyses at the quality and accuracy we expect. There’s much more to it than just making sure that it’s a common enough language to be listed as an option on Google Translate. (We’ll save our opinions on the quality of those translations for another day.)
For us to be able to “teach” our software a new language, three resources must be available for that language:
- A ConceptNet database
- Data on word frequency
- A parser
The first resource that must be available is a sufficient amount of data in that language in ConceptNet. We go into the details of what ConceptNet is in these blog posts, but simply put, it’s a collection of facts about how the world works (such as “the sun is hot,” “dogs and cats can both be pets,” etc.). This knowledge base gives our software the same understanding about the world that a human has when they enter a conversation. By having this understanding, our software can more quickly understand what new words mean and can map out the relationships between different concepts in a data set with a much higher degree of accuracy and relevancy.
Data on word frequency
Having data on word frequency in a given language is the next critical resource. This is basically a summary of which words are used most often and least often. Having this information enables the software to deprioritize very common words, which usually do not add much relevant information, and prioritize less common words. To use another English-language example, knowing word frequency helps the software understand that words like “know,” “want,” and “think” are not very meaningful on their own, and should not be presented as important in the data.
In addition, every language has common function words that do not contribute meaning but only serve to structure sentences, which we designate as "stop words." In effect, we tell our software to ignore those words entirely when conducting an analysis. In English, our list of stop words includes conjunctions like “and” and “or,” articles like “the” and “a/an,” and pronouns like “you” and “they,” among a whole host of other very common but not-very-meaningful words. Telling our software to ignore such words helps it to more quickly hone in on what’s truly relevant in a data set. Recognizing those, however, relies not just on their frequency but also on the results of the crucial third resource: the parser.
A parser is software that marks words with their parts of speech and determines their roots. For example, an English parser would identify the words “runs,” “ran,” and “running” as different variations of the same root word, “run”, and mark all three as verbs. When new data is uploaded into Luminoso’s software, this is the first step that must happen for our products to begin understanding what is being discussed. Our first step, as mentioned above, was to assess different Korean parsers to determine which best served our needs, after which we could begin implementation.
For Korean, implementation meant taking time to explore noun and verb endings, to determine if they are inherently part of their words or removable attachments. We also needed to explore the mistakes the parser made, as with some verbs that it would classify as suffixes; and like many parsers, it had no information about online slang, like the partial syllables used in Korean to indicate laughter (ㅋㅋㅋㅋㅋ) or crying (ㅠㅠㅠ).
When it came to stop words, the common function words we want our analyses to ignore, we needed to adapt the parser to ignore pronouns and conjunctions as it does in other languages. Beyond those, however, we found several kinds of adjectives that the parser subcategorized for us, and so we took time to use whatever information it could give us...as well as adding exceptions for cases where it didn't tell us enough.
Another challenge when preparing any language is its rules for negation, and the system in Korean provided challenges we had not seen in other languages. Korean has a "short form" negation analogous to putting "not" before a verb, and a "long form" that resembles adding "it is not the case that..." to the start of a sentence; the complications arise because the short form precedes verbs, while the long form follows it. We wanted to make sure that our system properly accounted for both kinds of negation: you certainly wouldn't want to think that a customer said they were happy when in fact they said they weren't happy, but with a different grammar than we expected!
Once we have our resources assessed and implemented, we can begin the process of refinement: adjusting and tweaking the way our software handles a particular language so that it returns results at the highest quality possible. This typically includes reviewing data from many different sources with a native speaker of that language, as well as streamlining our code as necessary to optimize the software’s performance.
When all is said and done, it takes 4-6 months for Luminoso to learn a new language. Unfortunately, during that same time, all I learned was three or four Korean nouns and a handful of prepositions....