LanguageTool uses Apache's tika library to do source language detection. Since LanguageTool supports more languages than are currently available in tika, we've created additional language profiles and add them into tika at runtime.
Tika doesn't support Belarusian, Catalan, Esperanto, Galician, Romanian, Slovak, Slovenian, Ukrainian, Malayalam, and Khmer. Language profiles have been added for all but the last two.
Adding a new language
To add a new language, you need to create an n-gram profile file. This is a collection of frequency counts for letter trigrams in natural language. Here are the steps to create a new language profile:
- Get a corpus in the source language, preferably with as little formatting and foreign words as possible. I used Wikipedia article dumps and stripped out the punctuation and XML. The result is available at http://www.languagetool.org/download/language-training-data/. Although Wikipedia contains a lot of proper nouns and foreign words, I've found the language detection works fairly well in practice. An additional (kind of cheating) trick you can do is: after you've created an initial .ngp profile for a language, comb through your corpus file and use the LanguageIdentifier class to remove obviously foreign lines. (E.g. there are a lot of completely English sentences in foreign language Wikipedia).
- Pass the cleaned up text to Apache nutch's NGramProfile.create() method. (To get nutch, go here. You'll also probably need some supporting packages)
- Save the output n-gram profile (NGramProfile.save())
- NGramProfile.create() creates up to four-grams, but tika only wants trigrams. Remove all the non-trigrams from the output .ngp file.
- Add the .ngp file to the appropriate resource/<lang>/ folder.
- Add the language's short name to the additionalLanguages array in LanguageIdentifierTools.java
- Edit build.xml to copy the .ngp file over to the dist/ folder