Checking The Complete Wikipedia or a Corpus

Introduction

You can run LanguageTool over the complete Wikipedia in a given language like this (always replace XX with your language code, e.g. en for English):

  1. Download the latest LanguageTool-wikipedia-yyyymmdd-snapshot.zip from our daily builds and unzip it.
  2. Download the latest XXwiki-latest-pages-articles.xml.bz2 from http://dumps.wikimedia.org/XXwiki/latest/. Note that this file is several GB large for some Wikipedias.
  3. In the unzipped LanguageTool directory, run java -jar languagetool-wikipedia.jar check-data --file XXwiki-latest-pages-articles.xml --language XX

The problem is that it will take very long, even on fast hardware (several days for languages with a large Wikipedia, like English or German). Thus, in the following we describe an alternative approach, where you analyze Wikipedia once, and then run single rules against it quite fast. You should be aware of the following drawbacks:

  • Initial analysis still takes hours for large Wikipedias.
  • It only works for pattern rules (those from grammar.xml), not for Java rules.

Still, it's a good way to evaluate rules against Wikipedia and to find all Wikipedia articles that match a given rule.

Howto

In the following instructions, always replace XX with your language code, e.g. en for English.

  1. Download and unpack the LanguageTool and Wikipedia files as described above.
  2. In the unzipped LanguageTool directory, run java -jar languagetool-wikipedia.jar index-data /path/to/XXwiki-latest-pages-articles.xml index-dir XX 0 to index the dump. This might take several hours for large Wikipedias.
  3. Run java -jar languagetool-wikipedia.jar search RULE_ID XX index-dir, whereas RULE_ID is the id of the pattern rule you want to run. This will find all candidate sentences that might match the rule, run LanguageTool on them and then print the matching sentences.

Background

The idea behind this approach is to extract all sentences from Wikipedia and index them using Lucene. This way, we have a fast way to find sentences that contain a given word. For example, a rule that finds uses of the word "lose" that should actually be "loose" only needs to check sentence that contain "lose". Only these sentences are then checked using LanguageTool. It depends on the rule how well this works - there might be rules that require a word like "the" which basically appears in any sentence. In that case, this approach won't be fast.

Custom corpus

You can also index a text file (for example, a large corpus). Run java -jar languagetool-wikipedia.jar index /path/to/corpus.txt index-dir XX.

Note:

  1. The indexer uses your platform encoding. For Windows, you might also need to specify it using Java VM properties if it is different from the file encoding (e.g., java -Dfile.encoding=UTF8 -jar…).
  2. The indexer does not allow very long lines. You should use some word-wrapping software to have lines < 255 characters (for example, fold command or an equivalent). The indexer by default splits paragraphs by two new lines.
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License