What is Spanglish? Before designing an algorithm to detect Spanglish words, I wasn’t exactly sure.
So I did what most data scientists would do when uncertain: I searched for a definition of Spanglish, a corpus of Spanglish terms and historical algorithms with some level of success.
All of these steps posed their own challenges.
First, a definition was not as clear-cut as I assumed, as Spanglish can refer to every way in which Spanish and English are blended, whether that is words, phrases, grammar, dialects or any other aspects of language.
SEC, or Spanish-English Code-switching, refers to the construction of Spanglish phrases or sentences, but our goal was to identify Spanglish words specifically, a large collection that is continuously growing.
The second step, accessing a corpus, was even more difficult as there is no satisfactory index of Spanglish words available. (In fact, during my extensive search, I accidentally downloaded the corpus for the 2004 Adam Sandler/Téa Leoni film.)
This meant we had to build our own corpus using multiple sources:
A corpus of Spanish from Python NLTK and Google language identification from a collection of Spanish newscasts
A corpus of English from Python NLTK and Google language identification from a collection of English newscasts
A corpus of other languages (not Spanish or Spanglish or English) from Python NLTK and Google language identification from a collection of English, Spanish and other newscasts
When we were satisfied that the corpus (10,039 Spanglish words) was large enough (though far from comprehensive), we moved onto historical algorithms that have been used to detect Spanglish. Unfortunately, those I found all had shortcomings, mostly based on two flawed methods:
Without a complete Spanglish corpus, no algorithm can accurately cross-reference words
A process of elimination, where a word is neither Spanish or English, highlights too many edge cases
These historical methods, when used on our test set, were hitting a maximum accuracy of 20%, far from anything we could rely upon. So we set out to create something with a high level of accuracy that didn’t require a comprehensive corpus.
Our model respected the primary language and we did no translation of the original text. Instead we used neural embeddings to determine parts of speech and structures of phrases in all recognized languages, leaving out Spanglish.
We combined all the methods of prior efforts and ran simulations to ensure our model could distinguish between English, Spanish and all other known languages.
Then, instead of scanning for Spanglish words and cross-referencing with our corpus, we used the probability of each word’s inclusion in the Spanglish cluster using K-means, given its part of speech and similarity to other language clusters.
This means that rather than scanning for the specific words that were in the Spanglish corpus, we were scanning for words that belonged in the corpus, regardless of whether or not they had already been included.
After thousands of simulations, our clustering algorithm was able to determine the probability of a given sentence containing a Spanglish word given its composition.
And by the end, using a labeled validation set, we proudly ended up with percentages of 92% accuracy in identifying actual Spanglish words.
This led me to my final conclusion on Spanglish: As it can broadly be defined as any combination of English and Spanish, It will never be possible to maintain a satisfactory index that keeps current with its entire vocabulary. But machine learning, particularly clustering algorithms, may offer the most effective way to keep pace with this continually evolving fusion of languages.