Glossary
After the Deadline
After the Deadline is open-source language-checking software. Refer to http://open.afterthedeadline.com.
API
An API (application programming interface) is a specification that describes how different software components interact. Refer to http://foldoc.org/API.
AtD
AtD is an abbreviation of After the Deadline.
attribute
In an XML document, an attribute is a part of an element. An attribute gives additional information about an element or the content of an element.
automaton
Refer to finite-state automaton.
AWK
AWK is a programming language that is designed for text processing. Refer to http://en.wikipedia.org/wiki/AWK.
base form
The term base form is a synonym of the term lemma.
bilingual text
A bilingual text is a text that contains 2 languages. Segments (usually sentences) are 'aligned'. For example, in a spreadsheet, one column contains source sentences and another column contains translated sentences. Refer to http://wiki.languagetool.org/checking-translations-bilingual-texts.
bitext
Refer to bilingual text.
Brill tagger
A Brill tagger is a type of POS tagger. A Brill tagger uses a small set of language rules to assign a part of speech to a word. Refer to http://en.wikipedia.org/wiki/Brill_tagger and http://acl.ldc.upenn.edu/H/H92/H92-1022.pdf.
category
In the grammar.xml file, a category is an element that is used to put rules into groups. Refer to http://wiki.languagetool.org/development-overview.
chunk
A chunk (or phrase) is a group of one or more words that has a particular part of speech. Refer to http://wiki.languagetool.org/using-chunks. Refer also to noun chunk; verb chunk.
chunk tag
A chunk tag is a tag that specifies the part of speech that a chunk has. A chunk tag also gives other information such as whether a word in a chunk is at the start of the chunk. Refer to http://wiki.languagetool.org/using-chunks#toc1.
chunker
A chunker is software that partitions plain text into sequences of semantically related words (http://cogcomp.cs.illinois.edu/page/software_view/13). While a POS tagger only works on single words, the chunker uses the results of the POS tagger and covers sequences longer than one word.
committer
A committer is a person who can put changes to LanguageTool on the LanguageTool repository in GitHub. The changes can be to the software, the rules, or to both. Refer also to language maintainer.
configuration file
A configuration file is a file that stores a user's settings or preferences. In LanguageTool, the configuration file is .languagetool.cfg in the user's home directory.
Constraint Grammar
Constraint Grammar is a language-independent method of parsing text. Refer to http://www.ling.helsinki.fi/~fkarlsso/CG-book_1.pdf.
corpus query language
A corpus query language is a method for extracting text samples from a corpus of text. [http://www.cl.uzh.ch/studies/theses/lic-master-theses/lizcharlottemerz.pdf]
curl
curl is command-line software that uses URL syntax to get or to send files. Refer to http://curl.haxx.se/docs/faq.html#What_is_cURL.
disambiguation.xml
disambiguation.xml is the LanguageTool file that contains the rules for a disambiguator. Not all languages have a disambiguation.xml file. Refer to http://wiki.languagetool.org/developing-a-disambiguator.
disambiguator
A disambiguator is software that tries to determine the part of speech that a word has in a particular context. (In many languages, a word can have more than one part of speech. For example, in English, the word help is both a noun and a verb. In the sentence, "Give me help," the word help is a noun.) In LanguageTool, if a language has a rule-based disambiguator, then the rules are in the disambiguation.xml file.
Eclipse
Eclipse is software that helps software developers to write programs in the Java programming language. Refer to http://www.eclipse.org.
element
An element is part of an XML document. Each XML document contains a hierarchy of elements. Each element represents the structure of some information. For example, in the grammar.xml file, each rule element contains information about a particular grammar rule. An element can have one or more attributes.
false friend
False friends are a pair of words or phrases in 2 different languages that look or sound almost the same, but which have different meanings. For example, the English word gift means poison in German.
filter (verb)
In the context of disambiguation, to filter means to remove all but one specified POS tag. Refer to http://wiki.languagetool.org/developing-a-disambiguator#toc4.
finite-state automaton (FSA)
A finite-state automaton is a device that can be in one of many states. In certain conditions, the FSA can change to a different state. Refer to http://en.wikipedia.org/wiki/Finite-state_machine and to http://csunplugged.org/finite-state-automata.
finite-state machine (FSM)
Refer to finite-state automaton.
FSA
Refer to finite-state automaton.
GitHub
GitHub is a website that helps software developers to collaborate, to review software code, and to manage software code. The software code for LanguageTool is on https://github.com/languagetool-org/languagetool/.
grammar.xml
grammar.xml is the LanguageTool file that contains the error detection rules that LanguageTool uses in its evaluation of text. Refer to http://wiki.languagetool.org/development-overview.
GUI
A GUI is a 'graphical user interface'. Typically, a screen shows small images, boxes in which to enter text, and buttons to click. Refer to http://en.wikipedia.org/wiki/Graphical_user_interface.
Hunspell
Hunspell is a spell checker and a morphological analyzer. Hunspell is designed for languages that have rich morphology, complex compounds, and complex character encoding. Refer to http://sourceforge.net/projects/hunspell/ and to http://hunspell.sourceforge.net.
immunize
To immunize means to prevent LanguageTool from matching one or more words. Refer to http://wiki.languagetool.org/developing-a-disambiguator#toc9.
inflected (adj)
Refer to inflection.
inflection
An inflection is the form of a particular lemma. For example, In English, the lemma break has the inflections break, breaks, broke, broken, and breaking.
Internationalization Tag Set (ITS)
The Internationalization Tag Set (ITS) is a technology that helps people to create XML that is internationalized and that can be localized effectively. Refer to http://www.w3.org/TR/2007/REC-its-20070403/ and to http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html.
ITS
Refer to Internationalization Tag Set (ITS).
Java
Java is a programming language that is also used to implement LanguageTool. Refer to http://www.java.com.
Javadoc
Javadoc is software from Oracle that uses comments in the source code to create API documentation in HTML format. Refer to http://www.oracle.com/technetwork/java/javase/documentation/index-jsp-135444.html.
language code
A language code is a code that represents a language, and possibly, a variant of a language. For example, the language code for German is de. Different standards for language codes exist. LanguageTool uses IETF language tags.
language maintainer
A language maintainer is a committer who maintains the rules in LanguageTool for one or more languages and who translates the LanguageTool GUI.
lemma
A lemma is the 'base form' of a word or of a group of words that is one lexical unit (lexeme). English examples: help, laser printer, put up with.
LGPL
LGPL (GNU Lesser General Public License) is a licence to use software. Refer to http://www.gnu.org/licenses/lgpl.html.
Lucene
Search engine library used by internally LanguageTool for some of the Wikipedia-related features. See http://lucene.apache.org.
markup language
A markup language is a computer language that uses special text to specify the structure or the style of the content of a document. Refer to element; XML.
Maven
Maven is a 'build automation tool' for software developers. Maven compiles and packages software, and manages dependencies. Refer to http://maven.apache.org.
Morfologik
the Morfologik stemming library is used by LT for dictionary lookups. Morfologik allows compressing large text dictionaries into small binary files with fast word lookup. https://github.com/morfologik/morfologik-stemming.
noun chunk (noun phrase)
A noun chunk is one or more words that acts as a noun. Refer to http://wiki.languagetool.org/using-chunks. LanguageTool identifies noun chunks using a chunker. Refer to http://www.inf.ed.ac.uk/teaching/courses/icl/lectures/chunk.2up.pdf
noun phrase
Refer to noun chunk.
part-of-speech tag
Refer to POS tag.
part-of-speech tagger
Refer to POS tagger.
POS
POS is an abbreviation for part of speech.
POS tag (part-of-speech tag)
A POS tag is a tag that identifies the possible parts of speech that a word has. Usually, in LanguageTool, a part-of-speech tag also includes other information, such as whether a noun is singular or plural.
POS tagger
A POS tagger is software that annotates (tags) words with part-of-speech information. A POS tagger gives each word one or more POS tags. Refer to http://wiki.languagetool.org/developing-a-tagger-dictionary.
property file
[?]
reading
http://wiki.languagetool.org/developing-a-disambiguator. Many English words can be, for example, a verb or a noun, depending on their context, so they have to readings. Example: "walk" can be a verb ("I walk home") or a noun ("I took a walk").
regular expression
A regular expression is text that specifies a search pattern. Refer to http://www.regular-expressions.info.
rulegroup
A rulegroup is an element in the grammar.xml file. If more than one rule is necessary to find an error, you can put all the applicable rules into a rulegroup. Refer to http://wiki.languagetool.org/development-overview#toc9.
shallow parser
The term shallow parser is a synonym for the term chunker.
skip (verb)
To skip means optionally to ignore one or more tokens in a sequence of tokens. For example, possibly you want to ignore a punctuation mark if it exists in a sequence of text. Refer to http://wiki.languagetool.org/development-overview#toc13.
SRX
SRX (Segmentation Rules eXchange) is a method of specifying the segmentation rules that software uses to split text into segments. (Typically, a segment is equivalent to a sentence.) Refer to http://www.gala-global.org/oscarStandards/srx/srx20.html.
tagger
Refer to POS tagger.
testrules.sh/testrules.bat
testrules is software that you can use to make sure that the rules that you write in LanguageTool are correct. Refer to http://wiki.languagetool.org/development-overview.
Tika
Tika™ is software that detects and extracts metadata (data about data) and structured text content from documents. Tika is supplied by Apache. Refer to http://tika.apache.org.
token
A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing (http://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html). Typically, in LanguageTool, a token is equivalent to a word. However, punctuation marks are also tokens.
Transifex
Transifex is an online translation-management system. The translations for LanguageTool are on https://www.transifex.com/projects/p/languagetool/.
unification
Unification is the matching of sequences of tokens that have some specified features. Refer to http://wiki.languagetool.org/using-unification.
unify
Refer to unification.
verb chunk
[In LT, what is a verb chunk? http://en.wikipedia.org/wiki/Verb_phrase]
verb phrase
Refer to verb chunk.
XML
XML is a markup language. Refer to http://xml.coverpages.org/xml.html. In LanguageTool, the language rules are specified using XML. Refer to disambiguation.xml; grammar.xml;
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License