Note: this is an archived page
We'd like to to change the way LT handles part-of-speech tags (POS tags). Why? As a reminder, this is what the POS tags look like for English:
NN Noun, singular or mass
NN:U Mass noun
Thus you need regular expressions to express a pattern like "a determiner or a noun". It will look like this: DT|NN.* - this is a problem for several reasons:
- rule contributors need to know the tag names
- rule contributors need to know regular expressions
- we're using regular expressions where it's not really needed
The idea is to use more verbose names, maybe like this:
- instead of DET, use: pos=determiner
- instead of NNS use: pos=noun, number=plural
This way we don't need regular expressions (except a way to express 'or'), and these tag names could be used in XML as well as in the new online rule editor.
We could either switch to the new POS tags completely, i.e. modify the dictionaries to contain them, or we could introduce a mapping/interpretation so that the dictionary information gets translated to the new tags after lookup. The latter seems more promising because:
- no need to touch the binary dictionaries
- the binary dictionaries use a compact representation instead of a verbose one, which helps keeping them compact (not sure how much of a difference this makes)
- we can migrate slowly, i.e. the old way of addressing tags keeps working (probably forever)
The drawback of a mapping/interpretation is that it requires some processing for each lookup, e.g. a lookup in a hash map. It only needs to be done once per token though, so this shouldn't be a problem.
- How exactly should the new POS tags look like?
- Answer: it depends on the language, but tag that will be shared by many languages are: pos, person, case, number, gender, tense etc.
- Can and should ISOcat (http://www.isocat.org) be used?
- How to express tags like VBP, i.e. Verb, non-3rd ps. sing. present? Note that this is not the infinitive, although in English it often has the same form.
- Answer: we don't. For VBP, we can use <token pos = "verb" person="1|2" number="sg" en:tense="present"/> or <token pos = "verb" person="3" number="pl" en:tense="present"/>