Glossary
This wiki has been moved to https://dev.languagetool.org - this page is archived and is not updated anymore
- After the Deadline
- After the Deadline is open-source language-checking software. Refer to http://open.afterthedeadline.com.
- API
- An API (application programming interface) is a specification that describes how different software components interact. Refer to http://foldoc.org/API.
- AtD
- AtD is an abbreviation of After the Deadline.
- attribute
- In an XML document, an attribute is a part of an element. An attribute gives additional information about an element or the content of an element.
- automaton
- Refer to finite-state automaton.
- AWK
- AWK is a programming language that is designed for text processing. Refer to http://en.wikipedia.org/wiki/AWK.
- base form
- The term base form is a synonym of the term lemma.
- bilingual text
- A bilingual text is a text that contains 2 languages. Segments (usually sentences) are 'aligned'. For example, in a spreadsheet, one column contains source sentences and another column contains translated sentences. Refer to http://wiki.languagetool.org/checking-translations-bilingual-texts.
- bitext
- Refer to bilingual text.
- Brill tagger
- A Brill tagger is a type of POS tagger. A Brill tagger uses a small set of language rules to assign a part of speech to a word. Refer to http://en.wikipedia.org/wiki/Brill_tagger and http://acl.ldc.upenn.edu/H/H92/H92-1022.pdf.
- category
- In the grammar.xml file, a category is an element that is used to put rules into groups. Refer to http://wiki.languagetool.org/development-overview.
- chunk
- A chunk (or phrase) is a group of one or more words that has a particular part of speech. Refer to http://wiki.languagetool.org/using-chunks. Refer also to noun chunk; verb chunk.
- chunk tag
- A chunk tag is a tag that specifies the part of speech that a chunk has. A chunk tag also gives other information such as whether a word in a chunk is at the start of the chunk. Refer to http://wiki.languagetool.org/using-chunks#toc1.
- chunker
- A chunker is software that partitions plain text into sequences of semantically related words (http://cogcomp.cs.illinois.edu/page/software_view/13). While a POS tagger only works on single words, the chunker uses the results of the POS tagger and covers sequences longer than one word.
- committer
- A committer is a person who can put changes to LanguageTool on the LanguageTool repository in GitHub. The changes can be to the software, the rules, or to both. Refer also to language maintainer.
- configuration file
- A configuration file is a file that stores a user's settings or preferences. In LanguageTool, the configuration file is .languagetool.cfg in the user's home directory.
- Constraint Grammar
- Constraint Grammar is a language-independent method of parsing text. Refer to http://www.ling.helsinki.fi/~fkarlsso/CG-book_1.pdf.
- corpus query language
- A corpus query language is a method for extracting text samples from a corpus of text. [http://www.cl.uzh.ch/studies/theses/lic-master-theses/lizcharlottemerz.pdf]
- curl
- curl is command-line software that uses URL syntax to get or to send files. Refer to http://curl.haxx.se/docs/faq.html#What_is_cURL.
- disambiguation.xml
- disambiguation.xml is the LanguageTool file that contains the rules for a disambiguator. Not all languages have a disambiguation.xml file. Refer to http://wiki.languagetool.org/developing-a-disambiguator.
- disambiguator
- A disambiguator is software that tries to determine the part of speech that a word has in a particular context. (In many languages, a word can have more than one part of speech. For example, in English, the word help is both a noun and a verb. In the sentence, "Give me help," the word help is a noun.) In LanguageTool, if a language has a rule-based disambiguator, then the rules are in the disambiguation.xml file.
- Eclipse
- Eclipse is software that helps software developers to write programs in the Java programming language. Refer to http://www.eclipse.org.
- element
- An element is part of an XML document. Each XML document contains a hierarchy of elements. Each element represents the structure of some information. For example, in the grammar.xml file, each rule element contains information about a particular grammar rule. An element can have one or more attributes.
- false friend
- False friends are a pair of words or phrases in 2 different languages that look or sound almost the same, but which have different meanings. For example, the English word gift means poison in German.
- filter (verb)
- In the context of disambiguation, to filter means to remove all but one specified POS tag. Refer to http://wiki.languagetool.org/developing-a-disambiguator#toc4.
- finite-state automaton (FSA)
- A finite-state automaton is a device that can be in one of many states. In certain conditions, the FSA can change to a different state. Refer to http://en.wikipedia.org/wiki/Finite-state_machine and to http://csunplugged.org/finite-state-automata.
- finite-state machine (FSM)
- Refer to finite-state automaton.
- FSA
- Refer to finite-state automaton.
- GitHub
- GitHub is a website that helps software developers to collaborate, to review software code, and to manage software code. The software code for LanguageTool is on https://github.com/languagetool-org/languagetool/.
- grammar.xml
- grammar.xml is the LanguageTool file that contains the error detection rules that LanguageTool uses in its evaluation of text. Refer to http://wiki.languagetool.org/development-overview.
- GUI
- A GUI is a 'graphical user interface'. Typically, a screen shows small images, boxes in which to enter text, and buttons to click. Refer to http://en.wikipedia.org/wiki/Graphical_user_interface.
- Hunspell
- Hunspell is a spell checker and a morphological analyzer. Hunspell is designed for languages that have rich morphology, complex compounds, and complex character encoding. Refer to http://sourceforge.net/projects/hunspell/ and to http://hunspell.sourceforge.net.
- immunize
- To immunize means to prevent LanguageTool from matching one or more words. Refer to http://wiki.languagetool.org/developing-a-disambiguator#toc9.
- inflected (adj)
- Refer to inflection.
- inflection
- An inflection is the form of a particular lemma. For example, In English, the lemma break has the inflections break, breaks, broke, broken, and breaking.
- Internationalization Tag Set (ITS)
- The Internationalization Tag Set (ITS) is a technology that helps people to create XML that is internationalized and that can be localized effectively. Refer to http://www.w3.org/TR/2007/REC-its-20070403/ and to http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html.
- ITS
- Refer to Internationalization Tag Set (ITS).
- Java
- Java is a programming language that is also used to implement LanguageTool. Refer to http://www.java.com.
- Javadoc
- Javadoc is software from Oracle that uses comments in the source code to create API documentation in HTML format. Refer to http://www.oracle.com/technetwork/java/javase/documentation/index-jsp-135444.html.
- language code
- A language code is a code that represents a language, and possibly, a variant of a language. For example, the language code for German is de. Different standards for language codes exist. LanguageTool uses IETF language tags.
- language maintainer
- A language maintainer is a committer who maintains the rules in LanguageTool for one or more languages and who translates the LanguageTool GUI.
- lemma
- A lemma is the 'base form' of a word or of a group of words that is one lexical unit (lexeme). English examples: help, laser printer, put up with.
- LGPL
- LGPL (GNU Lesser General Public License) is a licence to use software. Refer to http://www.gnu.org/licenses/lgpl.html.
- Lucene
- Search engine library used by internally LanguageTool for some of the Wikipedia-related features. See http://lucene.apache.org.
- markup language
- A markup language is a computer language that uses special text to specify the structure or the style of the content of a document. Refer to element; XML.
- Maven
- Maven is a 'build automation tool' for software developers. Maven compiles and packages software, and manages dependencies. Refer to http://maven.apache.org.
- Morfologik
- the Morfologik stemming library is used by LT for dictionary lookups. Morfologik allows compressing large text dictionaries into small binary files with fast word lookup. https://github.com/morfologik/morfologik-stemming.
- noun chunk (noun phrase)
- A noun chunk is one or more words that acts as a noun. Refer to http://wiki.languagetool.org/using-chunks. LanguageTool identifies noun chunks using a chunker. Refer to http://www.inf.ed.ac.uk/teaching/courses/icl/lectures/chunk.2up.pdf
- noun phrase
- Refer to noun chunk.
- part-of-speech tag
- Refer to POS tag.
- part-of-speech tagger
- Refer to POS tagger.
- POS
- POS is an abbreviation for part of speech.
- POS tag (part-of-speech tag)
- A POS tag is a tag that identifies the possible parts of speech that a word has. Usually, in LanguageTool, a part-of-speech tag also includes other information, such as whether a noun is singular or plural.
- POS tagger
- A POS tagger is software that annotates (tags) words with part-of-speech information. A POS tagger gives each word one or more POS tags. Refer to http://wiki.languagetool.org/developing-a-tagger-dictionary.
- property file
- [?]
- reading
- http://wiki.languagetool.org/developing-a-disambiguator. Many English words can be, for example, a verb or a noun, depending on their context, so they have to readings. Example: "walk" can be a verb ("I walk home") or a noun ("I took a walk").
- regular expression
- A regular expression is text that specifies a search pattern. Refer to http://www.regular-expressions.info.
- rulegroup
- A rulegroup is an element in the grammar.xml file. If more than one rule is necessary to find an error, you can put all the applicable rules into a rulegroup. Refer to http://wiki.languagetool.org/development-overview#toc9.
- shallow parser
- The term shallow parser is a synonym for the term chunker.
- skip (verb)
- To skip means optionally to ignore one or more tokens in a sequence of tokens. For example, possibly you want to ignore a punctuation mark if it exists in a sequence of text. Refer to http://wiki.languagetool.org/development-overview#toc13.
- SRX
- SRX (Segmentation Rules eXchange) is a method of specifying the segmentation rules that software uses to split text into segments. (Typically, a segment is equivalent to a sentence.) Refer to http://www.gala-global.org/oscarStandards/srx/srx20.html.
- tagger
- Refer to POS tagger.
- testrules.sh/testrules.bat
- testrules is software that you can use to make sure that the rules that you write in LanguageTool are correct. Refer to http://wiki.languagetool.org/development-overview.
- Tika
- Tika™ is software that detects and extracts metadata (data about data) and structured text content from documents. Tika is supplied by Apache. Refer to http://tika.apache.org.
- token
- A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing (http://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html). Typically, in LanguageTool, a token is equivalent to a word. However, punctuation marks are also tokens.
- Transifex
- Transifex is an online translation-management system. The translations for LanguageTool are on https://www.transifex.com/projects/p/languagetool/.
- unification
- Unification is the matching of sequences of tokens that have some specified features. Refer to http://wiki.languagetool.org/using-unification.
- unify
- Refer to unification.
- verb chunk
- [In LT, what is a verb chunk? http://en.wikipedia.org/wiki/Verb_phrase]
- verb phrase
- Refer to verb chunk.
- XML
- XML is a markup language. Refer to http://xml.coverpages.org/xml.html. In LanguageTool, the language rules are specified using XML. Refer to disambiguation.xml; grammar.xml;
page revision: 9, last edited: 24 Sep 2020 07:43