Archive: Developing a tagger dictionary

Archive

IMPORTANT NOTE: this is an archived page, please use the process described at Developing a tagger dictionary instead

Introduction

A tagger, or POS (part-of-speech) tagger is used to tag, or annotate, words with their respective part-of-speech information (see Wikipedia for more background information). Actually, POS tags usually convey more information, such as morphological information (plural / singular etc.).

Most taggers in LanguageTool are dictionary-based because statistical or context-oriented taggers are trained to ignore occasional grammar errors. For this reason, their output will be correct even if the input was in fact incorrect. While this is a desired behavior for most natural language processing applications, in grammar checking it is simply wrong.

We haven't however tried to train any statistical taggers on incorrect input to correct their output. This remains to be tested by someone who has enough time. However, we did test lexicon-based taggers/lemmatisers. For most languages, we use finite-state automata encoding for them. This means that the plain text files are prepared with a tool in the morfologik-stemming library. The resulting binary files are then used at runtime from Java code by morfologik-stemming library which is bundled with LanguageTool.

Preparing files is a bit tricky at first but it's worth the effort: the Polish input text file of the dictionary is about 190MB but as an automaton it gets squeezed into less than 3MB, plus the speed of the automaton tagger is really high. We cover the usual steps below.

Building the automaton

Use morfologik-stemming library 1.5.2 or newer. It has the ability to build binary automata from the command line. To use it, download the binary distribution of morfologik-stemming.

Preparing the lexicon

The input file for the automaton is a text file with three columns, usually tab-separated. The first field is an inflected word form, the second - the lemma, and the third - a POS (part-of-speech) tag. Don't use any whitespace for complex expressions. Whitespace will be ignored anyway by the tagger, so it's just wasting space in your dictionary. Example:

boyar    boyar    NN
boyard    boyard    NN
boyardism    boyardism    NN:UN
boyards    boyard    NNS
boyarism    boyarism    NN:UN
boyarisms    boyarism    NNS
boyars    boyar    NNS

(for the meaning of the English tags, see the documentation in our git repository)

Note that the process needs an input file with UNIX line endings, so if your lexicon file comes from Windows, run dos2unix on it before you proceed.

The following commands are usually used in order to get the resulting dictionary file:

Use the tab2morph and fsa_build tools found in morfologik-stemming library:

java -jar morfologik-tools-*-standalone.jar tab2morph -i input.txt -o output.txt
java -jar morfologik-tools-*-standalone.jar fsa_build -i output.txt -o output.dict

To make the file working in LanguageTool standard morfologik-stemming tagger, you need also an .info file which needs to be placed in the same directory as the binary automaton:

# Dictionary properties
fsa.dict.separator=+
fsa.dict.encoding=iso-8859-1
fsa.dict.uses-prefixes=false
fsa.dict.uses-infixes=false

UTF-8 encoding

Please note that you can use UTF-8 as the encoding of the dictionary in the .info file but remember that it must match the encoding of the input.txt file.

Internal FSA field separators

Note: if your dictionary contains lemmas or inflected forms with "+", you need to change the separator character. FSA by default uses '+' to separate the inflected form from the lemma, and the lemma from the tags. Usually, "_" is a safe bet, as this character is rarely a part of real dictionary words. You need the add the following line to the .info file:

fsa.dict.separator=_

You also need to use the -sep option of the tab2morph tool.

If only the tag field contains a separator character (like '+' for Polish), you don't have to worry. We stop processing after the second separator, so you might have as many separator characters as you want. Joining the tags using '+' can make the dictionary more compact in a binary form, and the tagger class PolishTagger.java is able to split the tag in a correct way.

Making the tagger file smaller

If the lexicon includes many words with prefixes or infixes, you can try to make the dictionary file smaller and faster to read from disk.

Use the following commands to get the resulting file:

java -jar morfologik-tools-*-standalone.jar tab2morph -pre -i input.txt -o output.txt
java -jar morfologik-tools-*-standalone.jar fsa_build -i output.txt -o output.dict

In this case you change the line:

fsa.dict.uses-prefixes=false

to

fsa.dict.uses-prefixes=true

However, in many cases you get even better results with infixes:

java -jar morfologik-tools-*-standalone.jar tab2morph -inf -i input.txt -o output.txt
java -jar morfologik-tools-*-standalone.jar fsa_build -i output.txt -o output.dict

You need to change properties in the .info file like this:

fsa.dict.uses-prefixes=true
fsa.dict.uses-infixes=true

Note: with infix encoding, both prefixes and infixes are used, so both must be set to "true".

You may make the file even smaller by using the special feature of morfologik-stemming, namely CFSA2 encoding. The resulting file will be 10% smaller and will run slightly faster. To use it, add -f cfsa2 to the command line:

java -jar morfologik-tools-*-standalone.jar fsa_build -f cfsa2 -i output.txt -o output.dict

The CFSA dictionaries will be automatically recognized by Morfologik-stemming: you don't need to change the properties in the .info file.

Exporting/Dumping the dictionary

If you want to see the contents of the binary file, you first need to download the binary distribution of morfologik-stemming. Unzip it and call the following command (replace "*" with the newest version number):

java -jar morfologik-tools-*-standalone.jar fsa_dump -x -d /path/to/languagetool/resource/en/english.dict >dump

The -x flag causes morfologik-stemming to unpack the internal representation of the data. Note: to have it working, you need to have the .info file in the same directory as the .dict file. If you don't have the .info file, use -r (raw data) switch.

You can then edit the file and send us the patches. Before you do so, you might want to contact us on the forum, as some input files are generated automatically from many other source files and may result from bugs in our scripts.

Troubleshooting

  • If the file is being built very slowly and is becoming huge, check if you have lots of ambiguous mapping between POS tags and word endings. If that's the case, you might try to use the trick used in the Czech and Polish dictionary: simply join the POS tags with "+" and reuse the Java code from the Czech tagger. It should help with making your file smaller.
  • It's wise to test if the input file has always exactly three non-empty fields. This is what this gawk script does:
BEGIN {FS="\t"}
{if (NF!=3) print "Not enough fields in the line: " $0
for (i=1;i<=3;i++) 
 if ($i=="") print "Empty field no. " i " on the line: " $0
}

If you are using tab2morph tool, the warnings will be displayed on the standard error during encoding the data.

Building a synthesizer dictionary

It's only for the brave ones ;)

The synthesizer dictionary generates an inflected form if you feed it with a lemma and a POS tag. It works with our Synthesizer class.

You need a very fancy script in AWK to build it. Let's call it synthesis.awk:

BEGIN {FS="\t"}
{print $2"|"$3"\t"$1}

What it basically does is reverting the fields and joining them with the "|" sign. The order is very important: otherwise the file will grow very fast and the dictionary will be useless. The command to get a synthesizer dictionary is the following:

gawk -f synthesis.awk input.txt >output.txt
java -jar morfologik-tools-*-standalone.jar tab2morph -nw -i output.txt -o encoded.txt
java -jar morfologik-tools-*-standalone.jar fsa_build -i encoded.txt -o dictionary_synth.dict

(The -nw switch turns off warnings that make sense only for standard tagging dictionaries.)

You also need a list of all POS tags in a text file. Save this as tags.awk:

BEGIN {FS="\t"}
{print $3}

And run:

gawk -f tags.awk input.txt | sort -u > demo_tags.txt

You also need the properties .info file:

# Dictionary properties
fsa.dict.separator=+
fsa.dict.encoding=iso-8859-1
fsa.dict.uses-prefixes=false
fsa.dict.uses-infixes=false

The only thing you can change in it is the encoding.

The synthesizer dictionary is used to generate inflected suggestions in heavily inflected languages. Note: it might be helpful to remove all forms from the synthesizer dict where POS tags indicate "unknown form", "foreign word" etc., as they only take space. Probably nobody will ever use them. It is also advisable to remove all archaic forms of main verbs (see English src/main/resources/org/languagetool/resource/en/filter-archaic.txt) for an example what you might want to exclude.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License