Missing Features
This wiki has been moved to https://dev.languagetool.org - this page is archived and is not updated anymore
Here are some ideas of possible LanguageTool features. If you have questions, comments, or would like to suggest another task, please send a message to the forum. The ideas are roughly sorted by the amount of work they require, and in those categories (short, medium, long) the tasks we consider more important come first.
Very simple tasks to help you get started
- See the github bug tracker issues marked with "easy fix"
- Write a new error detection rule for your native language. See http://wiki.languagetool.org/development-overview. For examples of real-world errors, see English Error Collection and German Error Collection. Requires: a bit knowledge of XML (which is easy to learn), no programming knowledge
Short Tasks
These are tasks that can probably be finished in a week or less.
- Fix a bug
- Go to our bug tracker, find a bug you can fix, and fix it! Look out for the "easy fix" tag which marks simple issues. Often requires: Java
- Fix a bug in a plugin
- Go to out bug tracker : fix a bug in the browser add-on
- Compare Morfologik speller with Hunspell
- Write a test that makes sure our Morfologik-based speller works as good as Hunspell (both for error detection and making correction suggestions)
- Enable using multiple rule sets
- Enable using multiple rule sets (different XML files) to implement custom sets that implement different style guides. For example, one could implement Chicago Manual of Style checker that is run only for scientific papers: the user would activate the standard English rules along with the custom sets.
- A more advanced version would enable loading the rules from a web-based repository of custom rules.
- Requires: Java, Contact: Marcin Miłkowski
- Convert glossaries into terminology checks for translation Quality Assessment
- Build a packager that takes a glossary in CSV or tabbed format and outputs a bitext XML rule (and allows to use CSV also directly): read the contents, tokenize and analyze with LT, build the rule in memory (optionally write it to disk or use it as well)
- Add target words from the glossary to the list of words ignored by the spell-checking rule
- In more advanced version, this could support also TBX (XML terminology format)
- Two interfaces: with a UI to answer questions (drop down a file, answer several questions, and get the file), and the command-line
- This feature would be nicer if accompanied with an ability to load multiple rule sets
- Requires: Java, Contact: Marcin Miłkowski
- Add TMX and XLIFF readers for bitext checking
- New classes for reading and writing TMX (possibly based on JAXB, and using XSLT to convert TMX to the current format) are needed to add real-world support for bitext checking
- Difficulty: moderately easy, requires just a bit of tweaking. The only difficulty is to support internal tags in XLIFF. Probably two kinds of XLIFF output would be needed: with corrections applied directly to the target text, and corrections as comments.
- Requires: Java
- Contact: Marcin Miłkowski
- Port usable English rules from XML copy editor
- Rules from XML Copy Editor for English could be interesting for LT (see its source in the xmlcopyeditor-1.0.9.5/src/rulesets directory).
- Requires: Java, if the conversion should be automatic
- Use a modern framework for embedded HTTP server
- Consider using e.g. http://sparkjava.com for our HTTP server - would this also support SSL/TLS?
Medium-term tasks
- Improve Neural Network rules
- Make the rules cover more confusion pairs
- User a larger training set (e.g. from Common Crawl) and see how that improves precision and recall
- As large datasets like Common Crawl are useful both for training and evaluation, find a way to index and query the data
- See https://forum.languagetool.org/t/neural-network-rules/2225 for details
- Extend AI approach
- We use neural networks to detect confusion between words. So far, it only considers 2 words of context in both directions. Extend this so the complete sentence is considered to better detect errors that depend on long-term dependencies.
- Consider a seq2seq approach
- Improve spell checker suggestions
- Suggestions for misspellings (or actually any suggestions in general) should consider the word right and left context, as does After the Deadline (section "The Spelling Corrector"). A lot of data is needed for that, but the existing ngram data can be used. However, this ngram data needs to be combined with the similarity data - words that are more similar to the original word should be preferred. Furthermore, we have data about which suggestion is selected by users. This should also be taken into account. Just like in After the Deadline, a neural net could learn how each factor needs to be weighted to get the best result.
- Improve Performance
- LanguageTool already uses several threads, but still doesn't always use 100% of the CPU even when busy. This could be optimized for the desktop use case.
- Enhance quality and speed of English chunking
- The English chunker is the slowest part of the English processing pipeline. This may be due to the fact that we need to use its POS tagger first. Check if the statistical POS tagger may be replaced simply by adding more hand-crafted disambiguation rules (with, maybe, simple frequency voting for non-disambiguated tokens).
- See if rule-based chunking may be better; shallow rule-based parsing may be fairly robust.
- Contact: Marcin Miłkowski, Daniel Naber
- Integrate some Wordnet-based and Verb-Net classifiers
- In some rules, we'd need to specify nouns that are hyponyms of "human being" to find incorrect uses of phrases. Create a lexicon extracted from English Wordnet (as finite-state machine) and add appropriate syntactic sugar to XML rules so that it would be usable (i.e., attribute is-a="person").
- Similarly, we need to have a lexicon of verbs followed by that clause (for an incomplete list, see: http://learnenglish.britishcouncil.org/en/english-grammar/verbs/verbs-followed-that-clause). Could this information be found in Wordnet? If not, maybe use VerbNet?
- Contact: Marcin Miłkowski
- Write a plugin
- Plugins for TinyMCE and other rich text editors for CMS systems
- Drupal - similar to https://drupal.org/project/wysiwyg_grammar_checker
- ApsiC XBench plugin, a very popular translation QA tool
- JEdit plugin: similar to spell-checking plugins available for it already
- Scribus plugin
- QuarkXpress, Adobe Pagemaker integration
- Bitext check for placeables / numbers
- In translated text, formatting elements or numbers should be left alone or converted to other units. Create a rule that (a) aligns the formatting elements / numbers on a token level; (b) marks up the elements that were not successfully aligned. Use Numbertext to align figures translated into text (i.e., 1 translated into "one").
- There is similar code in Java in the translation QA tool CheckMate. This is also available on LGPL, so one could reuse the code (or call Okapi library).
- Contact: Marcin Miłkowski
- Create an automatic extractor of rules based on transformation-based learning algorithm
- There is an existing prototype scripting code that takes a dump of Wikipedia history, converts the dump into a corpus of errors. The corpus of errors may be then processed with TBL rule-learning algorithm automatically to prepare rules similar to the ones used by LT. There is Java module for TBL learning here and here. See also here for details of how it would work.
- For more background, see also Wikipedia Error Corpus.
- Requires: good command of Java, Contact: Marcin Miłkowski
- Add interfaces for Russian valency checking.
- Add an abstract interface that would allow using our finite-state dictionaries for classifying words for purposes other than POS-tagging (а valency lexicon).
- Add new small dictionary with valency information for Russian (with participle and adjective).
- Prepare 10 rules that uses valency checks for Russian.
- Contact: Yakov Reztsov
- Take an orphaned language and make it state of the art
- Several languages no longer have active maintainers — we look for new maintainers and co-maintainers.
- The task consists of adding rules for a language, either AI-based, statistics-based, XML-based, or using Java.
- Requires: a very good command of the given language, knowledge of AI, XML, or Java
Long-term ideas
- Train a statistical tagger for English
- The standard statistical taggers obscure mistakes in the text because source corpora are tagged using intended tags, not the ones that were actually used. We might try to train a HMM tagger (such as Jitar), which for English should get us around 98% correctness. But for this, we need to change the tagging of the Brown corpus: change the original "correct" tags to the ones found by the LT tagger dictionary (if there is mismatch). For example, change places where "have" is tagged as "VBZ" to "VB".
- This requires a smart way to automatically retag the source corpus (to retain the disambiguation) and possibly some level of manual disambiguation as well. For this reason, this (otherwise easy) task may be time-consuming.
- If the method works, it may be applied to other languages as well to help with disambiguation. To test whether it does, we need to make sure that no rule in English (or any other language) is broken by the new tagger.
- Contact: Marcin Miłkowski
- Integrate a dependency parser
- Some English rules would really benefit from deeper parsing and, in particular, recognizing the subject of the sentence (this would be useful for agreement rules). MaltParser seems to be fast and nice, but it's based on Penn Treebank, and it is not completely free. So it would be required to train a new model, for example based on Copenhagen Treebank.
- Alternatively, maybe some shallow parsing would be needed, for example to identify NPs and VPs, as well as the heads of expressions (and their grammatical features)
Unsorted Ideas
- see http://papyr.com/hypertextbooks/grammar/gramchek.htm and An Evaluation of Microsoft Word 97’s Grammar Checker by Caroline Haist
- Checks for English by Proofread Bot: we could reimplement them (proofreadbot.com/fr/rules)
page revision: 224, last edited: 24 Sep 2020 07:48