Compounding Support in Morfologikspeller

MorfologikSpeller uses a finite word list encoded as an automaton. To support compounding, we could use the following algorithm:

  1. If the word is not found on the list, try to decompose it into building blocks: prefixes, infixes, and suffixes, and other parts. This can be done by trying to find word parts in a similar way as replaceRunOnWords, i.e., by moving a space (but with a predefined maximum of compound words probably > 2 but <= 4).
  2. We need to mark up incorrect compounds (words commonly mistaked) with a FORBID tag after a standard separator.
  3. We need to mark up prefixes (words that cannot occur on any other position but as a prefix) with a PREFIX tag.
  4. We need to mark up suffixes with a SUFFIX tag.
  5. We need to mark up infixes with an INFIX tag.

All non-FORBID words could be used to analyze a compound form. Words with PREFIX, SUFFIX and INFIX tags would not be proposed by replaceRunOnWords. To make them appear in suggestions, another instance of the same word without any tag would be needed (but suppression would still work).

Open Questions

  • Do we want to suggest compounds instead of words that have not been found? Hunspell users usually turn this feature off.
  • Would step 1 be computationally expensive? Trying to find four splits in a word may involve a lot of character operations. Maybe we could search the automaton in a more efficient way?
    • Maybe we could remember the current edit distance and consume the rest of the string if: there's input left and the current input is a word that allows compounds to be added. We must then only accept words that are allowed to be used in compounds. Thus technically, we might need more than one FSA: start of compound, middle of compound, end of compound. This method might provide (though not guarantee) good suggestions for compounds.
      • I am afraid that suggestions for compounds might be wrong anyway. Creating suggestions is a minor thing compared to recognizing them as valid words anyway.

Status (August 2014)

The following LanguageTool Maven projects still depend on Hunspell:

  • ast (HunspellRule)
  • da (HunspellNoSuggestionRule)
  • de (GermanSpellerRule - needs compound support)
  • eo (HunspellNoSuggestionRule)
  • es (HunspellRule)
  • fr (HunspellNoSuggestionRule)
  • gl (HunspellRule)
  • is (HunspellNoSuggestionRule)
  • km (HunspellRule)
  • nl (HunspellNoSuggestionRule, could probably also need compound support)
  • pt (HunspellNoSuggestionRule)
  • ru
  • sv (HunspellRule)
  • tl (HunspellRule)

Except for the languages that need compound support (currently only German), switching these languages to Morfologik-based spelling should be easy. German already uses Morfologik to create suggestions (see CompoundAwareHunspellRule.getSuggestions()), so the only thing that's missing is a Morfologik-based test to see if a word is valid. Idea:

  • Filter out those words that can be at the start of a compound
  • unmunch those
  • iterate and add "+" to the end of each word: (word+), and remove remaining flags that hunspell leaves
  • now build a Morfologik dictionary with these words
  • same for middle parts (+word+) and for end parts (+word)
  • look up the word, as we already do:
    • If it's found, it's spelled correctly.
    • If it's not found, split the word to part1+ and +part2. If both are found in the compound dictionary, the compound is accepted. If part1+ is found, but not +part2, work on part2 recursively.

Note that we cannot just look at the compound-related flag in the dict and ignore the others, we actually need to inflect the words ("Kühlschränke", "Kühlschranks", …).

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License