What happens when the CEO of Luminoso comes to my office and asks, “Can we do Arabic?”

In general, when people ask me whether Luminoso’s software can handle a language we don’t yet support – Estonian, Esperanto, Klingon, what have you – my answer is always “Yes, of course”. Admittedly, I follow this up with “That is to say, you can put it into the system and see what happens” … which is my answer because “handling” a language involves a number of complicated factors. We’d like to have some background knowledge in the language, and we’d like a word frequency list (see our chief science officer Robyn Speer’s blog post from earlier this month for more on that topic).

But the thing we need most is software to parse the text: to break it up into words and to give us base forms we can use to represent those words. Without that, analysts are left looking at our software and thinking, “Well, here’s what e-book users say about ‘reading’, and here’s what they say about ‘read’, and here’s what they say about ‘reads’, and … why are these different concepts?”. Of course, they’re not different concepts, but if you did put Klingon into our system, it wouldn’t know that be’Hom and be’Hompu’ are the same concept. (Those mean “girl” and “girls”. I had to look them up.) You would still find insights – you’d probably learn that “battle” and “happiness” are closely related in Klingon – they just wouldn’t be quite as solid as they would be if we had a parser.

So when the CEO comes to my office and asks, “Can we do Arabic?”, I give this explanation, ending with something like “So all we would need is software that can convert plurals to singulars and so forth.” At which point she says to me, “Terrific! Get right on that” – and I am reminded that talking to your CEO is different than talking to most other people. (Of course, to be fair, she knew we already have software that would do most of the work; my real task would be evaluating it and working around any idiosyncrasies I found.)

In truth, though, while the project looked daunting, it also looked exciting. Developing Russian for our product was an interesting journey, but in some ways a very familiar one. Russian has a different alphabet, but like English it forms plurals by putting a suffix on a noun, and forms tenses and other verb variations by putting a suffix on a verb, and so forth. All a parser has to do is recognize the word, take some letters off the end, and voilà: a root word that represents the base concept! Arabic doesn’t work that way at all.

How does Arabic work?

It turns out that there were two basic challenges to parsing Arabic, and its approach to suffixes was only the first one.

Take the Arabic root كتب, which is just the three consonants k, t, and b. It means “write”, and interspersing certain vowels will give you the words for “he wrote” (kataba), or “he writes” (yaktubu), or even “he dictates”, along with other vowels for the “I” form, the “you” form, and so forth. Add different vowels and you get a slew of related nouns: “book” (kitaab) or “library” (maktaba) or “office” (maktab)…to say nothing of the vowels you would change those to if you wanted a plural like “books” (kitub) or “offices” (makatib). All of which would be complicated enough, except that outside of the Qur’an, most of the vowels are almost never written, leaving a parser to reconstruct “yaktubu” from just “yktb”, and to know that “ytkb” is the same concept as the verb “write” but not the noun “book”. This bears so little relation to English or French or Russian that I hesitated to even believe anyone could write a parser to handle it.

Fortunately, I didn’t have to write the parser; once I had one that worked, I would merely need to offer some guidance, correct it when it went astray, and decide which of its many outputs I wanted (yaktubu? yktb? ktb? something in between?). Unfortunately, the language’s rules for word formation was only the first problem; my second problem was that no one speaks Arabic.

Now, obviously that can’t be true; with over 240 million speakers, Arabic is the fifth most spoken language in the world. It turns out, however, that what no one speaks is standard Arabic – that is, Modern Standard Arabic, or MSA. When speaking formally or in an international setting, as at the United Nations or on Al-Jazeera, speakers do indeed use this standard form. Outside of such settings, speakers use their local dialect: Moroccan, Sudanese, Egyptian, Levantine, and many others, and that extends to writing, especially in online forums like Twitter. Often the local written form matches the local spoken form – not unknown in online English, where someone might write “deez” instead of “these”, but much more common in written Arabic, and in this case rather than getting a nonsense word from a small variation in the spelling of “these”, you get a word meaning “delirious”. (Which actually happens.)

Early in the career of a computational linguist, you learn that most language-processing systems are designed to work on standard versions of languages: a French parser may not handle quirks of Québecois French, an English parser probably used news articles as training data and won’t know many of the words it sees on Twitter. Any Arabic parser would similarly be based on Modern Standard Arabic; could it be convinced to handle dialects?

Of course, there was also a third problem I haven’t even mentioned: I don’t speak Arabic. But here at Luminoso, we don’t let minor technicalities stop us, so we contracted a native speaker to help me, I downloaded a few apps to teach me the alphabet, and off we went.

What a parser can (and can’t) do

On the bright side, writing a program to parse Arabic wouldn’t really be my job; I only needed to evaluate the ones available and build on those. Some initial exploration suggested that pretty good parsers did indeed already exist. All the same, putting Arabic in our system wouldn’t be as simple as dropping one into our software and letting it roam free.

Many Arabic parsers are built on the grammatical structures seen in the Qur’an, which is written in language essentially the same as Modern Standard Arabic. Therefore, they may classify the prefix “l-” as ambiguous between the preposition “to” and an indicator of emphasis on the noun, but the latter is only used in literary Arabic (for instance, the Qur’an). We had to tell our software that if the parser categorized anything as “emphatic particle”, it should go back and find another option.

But there were other, subtler problems inherent to the nature of Arabic grammar. An “a-” prefix on a verb might indicate a causative form; it’s this form that turns “he writes” into “he dictates” (i.e., he causes someone to write), or “to know” into “to inform” (i.e., to cause someone to know something). On the other hand, an “a-” prefix can also indicate that “I” is the subject of the verb. A good Arabic parser may return both alternatives, but we found that we couldn’t necessarily rely on our parser to guess which right in a particular sentence. For this, I had to sit down with our native speaker and simply look at a lot of sentences and their parses, asking for each, “Did the parser return the right result here? What about here? If the result was wrong, was it at least a reasonable interpretation in context, or can we determine which result we wanted?”

In the end, we did have to accept some limitations of the parser. The Arabic word ما (“maa”) means “what”, but it is also used for negation in some circumstances, and deciding which as which proved too difficult for the computer. You see ambiguity in all languages, of course: in English, “can” might mean “is able to”, in which case it’s an ignorable common word, or it might mean “metal container”, in which case we wouldn’t want to ignore it. But most cases are easy to distinguish–you don’t even need the whole sentence to know which “can” is which in the phrases “the can” or “can see”. In this case, where both meanings are common function words, it became much harder to get reliable results.

The dialect problem never went away, but we did learn to minimize its effects. We included several common dialect spellings of function words on our “words to ignore” list, so that even if the parser thought they were nouns or verbs, we knew to skip them in our analysis. And we found that in an international data set like hotel reviews, there was enough Modern Standard Arabic for us to successfully gain insights from it. I’d want to fine-tune the program before loading, say, thousands of sentences of a single dialect, especially if that dialect varies significantly from the standard (Tunisian Arabic, for example, has influences from several European and African languages), but after the development we’ve already done, I’d be confident in our ability to do that fine-tuning.

A final unexpected challenge came when we looked at the results in our visualizer: many things were backwards! Not the words, fortunately, but arrows would point in the wrong direction, text would align flush against the wrong edge, even quotation marks would appear at the wrong edge of the text. It turns out that many, many programs, including web browsers, simply despair when you mix text that reads left-to-right (like English) with text that reads right-to-left (like Arabic).

It’s as confusing as it sounds.

That one turned out to be far easier to fix than we expected: style sheets for web pages allow you to specify that the direction of the text is right-to-left, at which point the browser everything flips to look the way it should.

What now?

In the end, I’m quite pleased at how well our system handles Arabic. Starting as a task that I knew would be hard and I feared would be simply impossible, this project has ended with the ability to find insights in Arabic text that I’d readily put up against our French or Russian capabilities. I can now tell people that I’ve taught a computer to understand Arabic, which may be an exaggeration, but it does still understand more Arabic than I do.

Adding Arabic also means that we can now find insights in the language of nearly 40% of the world’s population, including all six languages of the United Nations; and that we cover four of the five most spoken languages in the world– and who knows, perhaps Hindi will be next (unless Klingon turns out have higher demand than I anticipated, in which case, Heghlu’meH QaQ jajvam).

Related Posts

Step into the light

KAPS GROUP

The KAPS Group is a network of consultants with a wide range of skills and experience in text analytics, taxonomy, ontology and knowledge graphs, Python and other proprietary text analytics programming languages, and information and knowledge management.

Interested in becoming a partner? Contact Us Today!

About This Partnership

The KAPS Group is a network of consultants with a wide range of skills and experience in text analytics, taxonomy, ontology and knowledge graphs, Python and other proprietary text analytics programming languages, and information and knowledge management. It was founded by Tom Reamy, author of the most comprehensive book on text analytics, Deep Text.

IBM

IBM Consulting’s watsonx practice brings expertise in the generative AI technology stack as well as domain and industry experience that can help accelerate clients’ business transformations

Interested in becoming a partner? Contact Us Today!

About This Partnership

IBM Consulting’s watsonx practice brings expertise in the generative AI technology stack as well as domain and industry experience that can help accelerate clients’ business transformations. In the same way that we established our successful Hybrid Cloud services business built on the Red Hat® OpenShift® platform, IBM Consulting intends to be the leading consulting services provider for watsonx. Businesses are demanding AI that produces accurate and trustworthy results, can scale across clouds, and can be easily adapted to enterprise domains and use cases. Watsonx is designed to help them address those needs. Let’s put AI to work and make the world work better — together.
Smart Insight Logo

Smart Insight

It features capabilities like natural language understanding AI and analytics, allowing for comprehensive data usage across organizations.

Interested in becoming a partner? Contact Us Today!

About This Partnership

Smart Insight, operated by Uchida Yoko Co., Ltd., offers digital transformation (DX) tools like Mµgen. Mµgen integrates various data types, including IoT and big data, and supports visual data integration, AI-driven text analysis, and advanced analytics. It’s designed for quick deployment, reducing data warehouse needs and implementation costs. The tool is used by companies like Toyota, Toshiba, and Yamaha for DX initiatives. It features capabilities like natural language understanding AI and analytics, allowing for comprehensive data usage across organizations.

EDLIGO

EDLIGO offers an advanced, AI-powered comprehensive Talent Analytics solution for data-driven talent management, workforce planning, project staffing, competency management, employee experience, and retention management.

Interested in becoming a partner? Contact Us Today!

About This Partnership

EDLIGO GmbH is a leading company specializing in AI-powered Talent Analytics. EDLIGO offers an advanced, AI-powered comprehensive Talent Analytics solution for data-driven talent management, workforce planning, project staffing, competency management, employee experience, and retention management. We believe that employees are lifelong learners, so we have built a comprehensive solution that empowers organizations to master all aspects of talent management, including learning and development, with data and AI to drive the highest business impact.

EDLIGO has a strong track record, with customers successfully using our platform in more than twenty countries, boasting more than 2 million users, and filing 17 patents. In 2023, EDLIGO was recognized as one of Germany’s top three most innovative mid-sized companies in software.

Zyte

Zyte is a leader in web scraping services, offering advanced data extraction tools and proxy solutions to power business data needs efficiently and reliably.

Interested in becoming a partner? Contact Us Today!

About This Partnership

Zyte provides a comprehensive web data platform, specializing in extracting and delivering structured web data at scale. They offer solutions like AI-powered automatic extraction, cloud hosting for crawlers, and a proxy manager for seamless data scraping.

Zyte’s services are beneficial for businesses needing large-scale, reliable web data for market research, competitive analysis, and data-driven decision-making.

Their tools cater to various data types including e-commerce products, job postings, news articles, and real estate listings, ensuring high-quality data extraction.

Salesforce

Salesforce is a leading CRM provider, offering a unified platform for sales, service, marketing, and customer engagement, integrated with AI for enhanced business growth.

Interested in becoming a partner? Contact Us Today!

About This Partnership

Salesforce provides a comprehensive CRM platform, integrating sales, service, marketing, and customer experience tools.

Their AI-driven approach ensures efficient data handling, personalized customer interactions, and streamlined operations.

The platform benefits businesses of all sizes by enhancing customer relationships, improving sales productivity, and enabling effective marketing strategies.

Salesforce’s solutions are adaptable across various industries, helping companies achieve growth and operational excellence.

RainFocus

RainFocus offers a comprehensive platform for managing in-person, virtual, and hybrid events. They specialize in data-driven event management, providing robust registration flows, attendee engagement, and seamless omnichannel marketing.

Interested in becoming a partner? Contact Us Today!

About This Partnership

RainFocus’s platform is designed to streamline event management across various lifecycle phases. It offers a unified approach to plan, manage, deliver, and optimize events, ensuring personalized attendee experiences.

Their solutions are beneficial for businesses seeking efficient event orchestration, as they enable data integration, flexibility, and customization. This approach results in enhanced attendee engagement, operational efficiency, and strategic marketing alignment.

HiFly Labs

Hiflylabs is a data solutions company offering data engineering, science, strategy advisory, and visualization. They focus on creating enterprise solutions with an emphasis on practicality and efficiency.

Interested in becoming a partner? Contact Us Today!

About This Partnership

Hiflylabs provides tailored data services, including data engineering, science, and visualization. They cater to various industries, offering specialized solutions like Appic for app development and Hifly SODA for sales-oriented analytics.

Their approach focuses on leveraging modern technologies and ecosystems like Databricks, dbt, and the Modern Data Stack, ensuring robust, flexible, and powerful tools for their clients. This helps clients optimize their data handling and business value creation processes.

Data Ideology

Data Ideology specializes in data strategy, engineering, AI, and analytics, offering solutions to maximize data-driven outcomes and insights.

Interested in becoming a partner? Contact Us Today!

About This Partnership

Data Ideology provides comprehensive data services, including strategy, engineering, AI, and analytics. They help businesses identify data-driven opportunities and create strategies for optimal outcomes.

Their services include building robust data pipelines, streamlining data processing, and leveraging AI for actionable insights.

This approach ensures data quality, compliance, and maximizes the strategic value of data assets, aiding organizations in making informed, data-driven decisions.

8x8

8×8, Inc. is a provider of integrated cloud communications and customer engagement solutions, offering unified communications, contact center, video conferencing, and team chat services.

Interested in becoming a partner? Contact Us Today!

About This Partnership

8×8 delivers a unified platform for contact center, voice, video, chat, and embedded communications. Their solutions focus on enhancing customer experience, agent engagement, and employee connectivity.

Offering reliable, secure, and compliant services, 8×8 integrates with business and CRM applications like Microsoft Teams and Salesforce.

Their technology supports businesses in various industries, ensuring efficient communications and collaboration, global reach, and data-driven insights.

Vatis Tech

Vatis Tech provides an AI-powered speech-to-text infrastructure tool, offering high accuracy and efficiency in transcribing audio and video data for various industries.

Interested in becoming a partner? Contact Us Today!

About This Partnership

Vatis Tech specializes in AI-driven speech-to-text technology, serving sectors like contact centers, broadcasting, medical, legal, media, and education.

Their platform features high accuracy, real-time transcription, and support for multiple languages and formats. It benefits users by enhancing data accessibility, improving workflow efficiency, and enabling more effective content analysis.

The technology is particularly beneficial for organizations needing rapid, precise transcription of large volumes of audio or video data.

OnlineSales

OnlineSales.ai is an advanced retail media monetization platform, offering AI-powered advertising solutions for retailers to optimize ad revenues.

Interested in becoming a partner? Contact Us Today!

About This Partnership

OnlineSales.ai specializes in retail media monetization with an AI-driven platform. It offers tools like sponsored product ads, display ads, offsite ads, and email ads to enhance digital marketing.

The platform enables retailers to increase ad revenues, deliver personalized shopping experiences, and automate ad campaign management.

Key benefits include maximizing ad spending, scaling advertising efforts, and providing an immersive shopper experience. The service is designed to be fully white-labeled and self-serve, ensuring user-friendly operation and customization according to business needs.

BabelStreet

Babel Street is a data analytics platform offering threat intelligence tools. They specialize in AI-enabled analysis of publicly and commercially available information for risk mitigation, fraud detection, and security.

Interested in becoming a partner? Contact Us Today!

About This Partnership

Babel Street’s platform empowers organizations with AI-driven insights from vast public and commercial data sources. It offers multilingual understanding, end-to-end automation, and extensive source access.

The platform is useful for threat intelligence, risk mitigation, and fraud detection. It’s valuable to government, law enforcement, and commercial sectors for its ability to process and analyze large volumes of data, helping them stay ahead of threats and risks.

Paychex

Paychex is a leading provider of integrated human capital management solutions for payroll, benefits, human resources, and insurance services.

Interested in becoming a partner? Contact Us Today!

About This Partnership

Paychex offers a range of services aimed at simplifying payroll and HR processes for businesses. Their solutions cover payroll, benefits, insurance, and HR administration.

By automating and streamlining these aspects, Paychex helps businesses save time and reduce errors. They cater to small and mid-sized businesses, providing tools for tax administration, employee onboarding, and regulatory compliance.

Their platform is designed to be user-friendly, ensuring a seamless experience for employers and employees alike.

Experience

Experience.com is a platform offering solutions for customer and employee experience management, as well as online reputation management, using AI-driven feedback campaigns.

Interested in becoming a partner? Contact Us Today!

About This Partnership

Experience.com provides AI-powered tools for managing customer and employee experiences, and online reputation. Their platform aids businesses in driving intelligent customer and employee feedback campaigns, amplifying marketing efforts, and enhancing customer-focused employee behavior.

It supports industries like banking, insurance, real estate, and healthcare, helping companies build a strong brand reputation and culture, ultimately leading to better client engagement and operational efficiency.

Qlik

Qlik provides data integration, data quality, and analytics solutions, integrating AI for advanced data management and actionable insights.

Interested in becoming a partner? Contact Us Today!

About This Partnership

Qlik offers a comprehensive data and AI platform, integrating data integration and quality solutions with advanced analytics and AI.

Their services help companies optimize data management, enhancing decision-making and operational efficiency. Qlik’s AI-assisted analytics empower users of all skill levels, facilitating better data understanding and use.

Their tools assist in data quality governance, real-time data movement, and machine learning, supporting clients in various industries to leverage their data effectively.

Databricks

Databricks specializes in AI and data intelligence, offering a platform that integrates data management, real-time analytics, and AI for efficient data processing and insights.

Interested in becoming a partner? Contact Us Today!

About This Partnership

Databricks provides a data intelligence platform, integrating ETL, data ingestion, business intelligence, AI, and governance tools. It helps organizations in efficiently managing and analyzing large volumes of data, aiding in better decision-making.

The platform is designed to simplify complex data processing, ensuring data privacy and control while developing AI applications.

Key benefits include streamlined workflows, enhanced data management, and the ability to drive insights using natural language. Databricks caters to various industries, optimizing operations and accelerating success in data and AI initiatives.

Knowledge Works Logo

Knowledge Works

KnowledgeWorks is dedicated to transforming education through personalized, competency-based approaches and systems change to benefit students and educators.

Interested in becoming a partner? Contact Us Today!

About This Partnership

KnowledgeWorks focuses on reimagining education to ensure all students, regardless of background, can thrive. They provide tools and guidance for personalized, competency-based learning, advocating for policies that support this model.

Their work includes strategic planning, workshops, and resources for educators and policymakers. By fostering student-centered learning environments, they aim to create equitable educational opportunities, preparing students for an evolving world.

Minerva Logo

MinervaCQ

Minerva CQ specializes in AI-enhanced support for contact centers, focusing on customer-agent interaction optimization through real-time assistance, workflow adaptation, and knowledge surfacing.

Interested in becoming a partner? Contact Us Today!

About This Partnership

Minerva CQ revolutionizes customer service in contact centers using AI. Their system analyzes millions of interactions to assist agents in real-time, offering insights, data, and workflow optimization.

This leads to personalized, efficient customer interactions. Key benefits include improved customer experience, reduced handle times, enhanced agent performance, and increased revenue opportunities.

Minerva CQ also focuses on reducing agent onboarding times and optimizing training, making every agent more effective in their role.

Clarteza Logo

Clarteza

Clarteza is an innovation agency specializing in consumer insights and brand strategy, leveraging AI, innovative research methods, and curated technologies to understand and connect with consumers.

Interested in becoming a partner? Contact Us Today!

About This Partnership

Clarteza focuses on driving brand innovation by deeply understanding consumer behavior and needs. They use AI and unique research methods to gather insights and translate these into actionable strategies for brands.

Their services benefit clients by enhancing brand positioning, improving consumer engagement, and guiding product development.

Clarteza’s approach helps brands connect with consumers more effectively, ensuring that their products and services are aligned with consumer expectations and market trends.

CEE Logo

The Centre For Educational Effectiveness

The Center for Educational Effectiveness (CEE) specializes in developing surveys, data tools, and services to support the growth of communities, districts, schools, and individuals. They focus on creating a positive impact in the educational sector since 1999.

Interested in becoming a partner? Contact Us Today!

About This Partnership

CEE partners with over 950 schools in 280 districts, offering services like strategic planning, coaching, professional development, and research projects.

They help educational institutions use data effectively, build strategic plans, improve leadership skills, and review programs objectively.

CEE’s approach centers on understanding and improving school climate and culture, enhancing performance, and promoting continuous improvement.

Realty Check Logo

Reality Check

RealityCheck is a full-service market research firm specializing in advanced qualitative analysis, quantitative research, and integrated qual/quant approaches.

Interested in becoming a partner? Contact Us Today!

About This Partnership

RealityCheck offers deep consumer insights for strategic decision-making in brand strategy, concept testing, and consumer experience mapping.

Their unique approach combines advanced qualitative and quantitative methods, focusing on the critical 10% of new information essential for business growth.

They excel in translating complex data into actionable strategies, aiding companies in understanding and engaging with their customers effectively.

Socratic Technologies Logo

Socratic Technologies

Sotech offers comprehensive research services including product testing, strategy consulting, message testing, and brand health tracking.

Interested in becoming a partner? Contact Us Today!

About This Partnership

Sotech is a leader in concept testing services. Sotech offers comprehensive research services including product testing, strategy consulting, message testing, and brand health tracking. They cater to various industries like consumer products, financial services, restaurants, and technology.

Their approach focuses on collaboration, innovative solutions, and strategic insights to help clients make informed decisions.

Sotech’s expertise in market research and concept testing enables businesses to understand consumer preferences, optimize product development, and enhance brand positioning, thereby ensuring customer satisfaction and market success.

Mckinney Logo

McKinney

McKinney & Company is a multi-discipline planning, design, and construction firm known for its innovation and comprehensive project delivery approach.

Interested in becoming a partner? Contact Us Today!

About This Partnership

McKinney & Company specializes in integrating multiple disciplines like architecture, engineering, and construction management to offer innovative and efficient solutions. With a commitment to collaboration and quality, the firm ensures projects are completed to a high standard, on time, and within budget.

This approach has led to its reputation for handling challenging projects and delivering lasting value, making it a trusted partner for clients seeking comprehensive, high-quality services in planning, design, and construction.

Shapiro+Raj

Shapiro & Raj

Shapiro+Raj is a strategic insights consultancy specializing in social science, data analysis, and creative strategies, with over 60 years of industry experience

Interested in becoming a partner? Contact Us Today!

About This Partnership

Shapiro+Raj is a future-forward insights consultancy recognized as a leading strategic insights firm. They are distinguished for being innovative, having earned a top-25 most innovative company recognition for five consecutive years.

As the largest minority insights company, Shapiro+Raj operates with an integrated team comprising social scientists, data analysts, brand strategists, and creative ideators. Their approach combines social science and behavioral economics, enhanced by a blend of technology and humanity.

The company boasts over six decades of experience in various industries and has contributed to over $100 billion in market cap growth for their clients in the past seven years

Company Name

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.

About This Partnership

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.