Missing Features

Here are some ideas of possible LanguageTool features. If you have questions, comments, or would like to suggest another task, please send a message to the forum.

The ideas are roughly sorted by the amount of work they require, and in those categories (short, medium, long) the tasks we consider more important come first.

Very simple tasks to help you get started

Short Tasks

These are tasks that can probably be finished in a week or less.

  • Fix a bug
    • Go to our bug tracker, find a bug you can fix, and fix it! Look out for the "easy fix" tag which marks simple issues. Often requires: Java
  • Fix a bug in a plugin
    • Go to bug tracker Microsoft Office integration: Fix remaining bugs in the MS Office Plugin
  • Compare Morfologik speller with Hunspell
    • Write a test that makes sure our Morfologik-based speller works as good as Hunspell (both for error detection and making correction suggestions)
  • Enable using multiple rule sets
    • Enable using multiple rule sets (different XML files) to implement custom sets that implement different style guides. For example, one could implement Chicago Manual of Style checker that is run only for scientific papers: the user would activate the standard English rules along with the custom sets.
    • A more advanced version would enable loading the rules from a web-based repository of custom rules.
    • Requires: Java, Contact: Marcin Miłkowski
  • Add more rules for second-language learners
    • False-alarm rules can be easily used to detect specific mistakes made by non-native language speakers, which are not necessarily simply lexical.
    • A more advanced version of this task would also involve creating an app (web app or mobile app) that would use LT rules for helping non-native speakers study the foreign language.
    • Requires: XML, and in the advanced version, some web or mobile framework
  • Convert glossaries into terminology checks for translation Quality Assessment
    • Build a packager that takes a glossary in CSV or tabbed format and outputs a bitext XML rule (and allows to use CSV also directly): read the contents, tokenize and analyze with LT, build the rule in memory (optionally write it to disk or use it as well)
    • Add target words from the glossary to the list of words ignored by the spell-checking rule
    • In more advanced version, this could support also TBX (XML terminology format)
    • Two interfaces: with a UI to answer questions (drop down a file, answer several questions, and get the file), and the command-line
    • This feature would be nicer if accompanied with an ability to load multiple rule sets
    • Requires: Java, Contact: Marcin Miłkowski
  • Add UGTag to support tagging in Ukrainian
    • Add support for Ukrainian tagger UGtag. It is already in Java, so it should be quite if not trivially easy.
    • Translation of the UI
    • Contact: Andriy Rysin
  • New German rule
    • "Vergleichs" vs "Vergleiches" etc.: rule that checks that only one variant per document is used
    • Requires: Java
    • Contact: Daniel Naber
  • Create a general mechanism to store rule parameters
    • Some rules could take some user-set parameters but we have no general way to store these in configuration files (such as sensitivity level).
    • Devise a way to store and retrieve rule parameters from configuration files.
    • Requires: Java, Contact: Marcin Miłkowski
  • Add TMX and XLIFF readers for bitext checking
    • New classes for reading and writing TMX (possibly based on JAXB, and using XSLT to convert TMX to the current format) are needed to add real-world support for bitext checking
    • Difficulty: moderately easy, requires just a bit of tweaking. The only difficulty is to support internal tags in XLIFF. Probably two kinds of XLIFF output would be needed: with corrections applied directly to the target text, and corrections as comments.
    • Requires: Java
    • Contact: Marcin Miłkowski
  • Port usable English rules from XML copy editor
    • Rules from XML Copy Editor for English could be interesting for LT (see its source in the xmlcopyeditor-1.0.9.5/src/rulesets directory).
    • Requires: Java, if the conversion should be automatic
  • Automatic conversion of LightProof rules
    • The rules for the LightProof checker are available for some languages (French, Hungarian). They are a subset of what LT can do, so automatic conversion to XML rules should be easy.
    • Requires: Java

Medium-term tasks

  • Enhance quality and speed of English chunking
    • The English chunker is the slowest part of the English processing pipeline. This may be due to the fact that we need to use its POS tagger first. Check if the statistical POS tagger may be replaced simply by adding more hand-crafted disambiguation rules (with, maybe, simple frequency voting for non-disambiguated tokens).
    • See if rule-based chunking may be better; shallow rule-based parsing may be fairly robust.
    • Contact: Marcin Miłkowski, Daniel Naber
  • Integrate some Wordnet-based and Verb-Net classifiers
    • In some rules, we'd need to specify nouns that are hyponyms of "human being" to find incorrect uses of phrases. Create a lexicon extracted from English Wordnet (as finite-state machine) and add appropriate syntactic sugar to XML rules so that it would be usable (i.e., attribute is-a="person").
    • Similarly, we need to have a lexicon of verbs followed by that clause (for an incomplete list, see: http://learnenglish.britishcouncil.org/en/english-grammar/verbs/verbs-followed-that-clause). Could this information be found in Wordnet? If not, maybe use VerbNet?
    • Contact: Marcin Miłkowski
  • Improve spell checker suggestions
    • Suggestions for misspellings should consider the word right and left context, as does After the Deadline (section "The Spelling Corrector"). A lot of data is needed for that, so it will need to be efficiently compressed (or maybe it could become a server-only feature).
  • Bitext check for placeables / numbers
    • In translated text, formatting elements or numbers should be left alone or converted to other units. Create a rule that (a) aligns the formatting elements / numbers on a token level; (b) marks up the elements that were not successfully aligned. Use Numbertext to align figures translated into text (i.e., 1 translated into "one").
    • There is similar code in Java in the translation QA tool CheckMate. This is also available on LGPL, so one could reuse the code (or call Okapi library).
    • Contact: Marcin Miłkowski
  • Rule priority/severity and register
    • Could simplify configuration, but configuration dialog is hard to change. Requires that the config dialog is changed first.
    • Contact: Marcin Miłkowski
  • Add interfaces for Russian valency checking.
    • Add an abstract interface that would allow using our finite-state dictionaries for classifying words for purposes other than POS-tagging (а valency lexicon).
    • Add new small dictionary with valency information for Russian (with participle and adjective).
    • Prepare 10 rules that uses valency checks for Russian.
    • Contact: Yakov Reztsov
  • Take an orphaned language and make it state of the art
    • Several languages no longer have active maintainers — we look for new maintainers and co-maintainers.
    • There are even languages that have complete tooling but lack rules, such as Czech.
    • The task consists of adding at least 250 rules for a language.
    • Requires: a very good command of the given language, knowledge of XML and Java

Long-term ideas

  • Train a statistical tagger for English
    • The standard statistical taggers obscure mistakes in the text because source corpora are tagged using intended tags, not the ones that were actually used. We might try to train a HMM tagger (such as Jitar), which for English should get us around 98% correctness. But for this, we need to change the tagging of the Brown corpus: change the original "correct" tags to the ones found by the LT tagger dictionary (if there is mismatch). For example, change places where "have" is tagged as "VBZ" to "VB".
    • This requires a smart way to automatically retag the source corpus (to retain the disambiguation) and possibly some level of manual disambiguation as well. For this reason, this (otherwise easy) task may be time-consuming.
    • If the method works, it may be applied to other languages as well to help with disambiguation. To test whether it does, we need to make sure that no rule in English (or any other language) is broken by the new tagger.
    • Contact: Marcin Miłkowski
  • Integrate a dependency parser
    • Some English rules would really benefit from deeper parsing and, in particular, recognizing the subject of the sentence (this would be useful for agreement rules). MaltParser seems to be fast and nice, but it's based on Penn Treebank, and it is not completely free. So it would be required to train a new model, for example based on Copenhagen Treebank.
    • Alternatively, maybe some shallow parsing would be needed, for example to identify NPs and VPs, as well as the heads of expressions (and their grammatical features)

Unsorted Ideas

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License