Developing Chinese rules

This wiki has been moved to https://dev.languagetool.org - this page is archived and is not updated anymore

Here's the document of instructions of developing Chinese rules.

Background

Microsoft Word has always been the first choice of word processing tool for most Chinese, Japanese and Korean (CJK) users. One of the reasons why other open source ones like OpenOffice and LibreOffice cannot attract them is that they lack of grammar error checking components for eastern languages. As a state-of-art language proofreading tool, LanguageTool has proved itself to be very successful for western languages. If LanguageTool can support CJK, it’ll definitely propel the wider adoption of LanguageTool as well as OpenOffice and LibreOffice, given the huge number of users in the eastern countries. Lessons on Chinese pattern rule creation learned from this project will benefit the development of other eastern languages in future. This work will open the door for the eastern world and make LanguageTool more attractive for other Japanese or Korean developers.

It’s no easy job, because there’re substantial differences of eastern and western languages on grammar and syntax features. For example, Chinese language is famous for its traditional idioms whose words are not separated by spaces, hence there would be word splitting component (Maybe a Java Rule) in Chinese language pack working before grammar checking, which is the same case for other CJK. Lessons on Chinese pattern rule creation learned from this project will benefit the development of other eastern languages in future. This work will open the door for the eastern world and make LanguageTool more attractive for other Japanese or Korean developers.

Chinese NLP Tool

ictclas4j is chosen as the Chinese NLP tool for Chinese rule development. It's both a Chinese word splitter and POS tagger. The licence is Apache License 2.0.

The Chinese POS from ictclas4j are shown as below:

  • a: Adjective
  • c: Conjunction
  • d: Adverb
  • e: Exclamation
  • f: Position
  • m: Numeral
  • n: Noun
  • nr: People Name
  • ns: Place Name
  • o: Onomatopoeia
  • p: Prepositional
  • q: Quantity
  • r: Pronoun
  • s: Space
  • t: Time
  • u: Auxiliary
  • v: Verb
  • vd: Verb use as Adverb
  • w: Punctuation

Reference Books of Chinese Grammar

The Chinese modern language grammar have been studied and summarised well by many Chinese language researchers. It makes sense to reuse their work on Chinese grammar error study. Translating the Chinese grammar errors they summarised into LanguageTool rules is the most efficient way of developing Chinese rules.

Two books are chosen as references:

  1. Liu Yuehua, Pan Wenyu, Gu Wei. Practical Modern Chinese Grammar[M]. The Commercial Press, 2001.
  2. Modern Chinese Grammar Analysis [M]. The Commercial Press.

Using "Chinese Grammar System" as clues, we also consult many books published by The Commercial Press, which contain index of Chinese grammar errors in the last 10 years' from "National Matriculation Examination(NME)". We keep most of rules from NME, because NME is one of the most important and official grammar reference in China, which collects the most common grammar mistakes. For other rules, we try to make sure that those mistakes are common. To guarantee this, we draw up two standards. The first standard is The rule belong to the intersection of any two references; The second is that we can find these kinds of mistakes in Chinese wikipedia.

Chinese Rule Groups

According to the references, we find 3 groups of most common Chinese grammar errors:

  • Morphology Content Words (wa):
    • Chinese Part Of Speech (wa1)
    • Chinese Time, Position (wa2)
    • Modal Verb (wa3)
    • Quantifier (wa5)
    • Pronouns (wa6)
    • Overlapping Words (wa7)
    • Other Errors
  • Morphology Empty Word (wb):
    • Adverb (wb1)
    • Preposition (wb2)
    • Conjunction (wb3)
    • Particle (wb4)
  • Syntax (s):
    • Component Redundancy (s1)
    • Mismatches (s3)
    • Ambiguity (s5)

Rule Example Explanation

Example 1: Quantifier

The rule is the first rule in Rulegroup, Id:WA5. Which is to show you How we deal with Quantifier.

<rule>
  <pattern>
    <token>二三</token>
  </pattern>
 
  <message> 应该为:<suggestion></suggestion>三。 </message>
  <short>数量词的运用</short>
  <example type="incorrect"><marker>二三</marker></example>
  <example type="correct">两三个</example>
</rule>

In Modern Chinese, "两三" means "two or three", and it's usually wrongly replaced by "二三". Although, "两" contains the same meaning as "二" in Chinese, it's not correct to use "二" and "三" together in Chinese grammar. With this rule, we match numeral, "二三". And suggest our users replacing it with "两三".

Example 2: Particle

The rule is the second rule in Rulegroup, Id:WB4. Which is to show you How we deal with Particle.

<rule>
  <pattern mark_from="1" mark_to="-1">
    <token postag_regexp="yes" postag="a|d"/>
    <token postag="u" skip="-1"></token>
    <token postag="v|vd" postag_regexp="yes"><exception regexp="yes">是|有</exception></token>
  </pattern>
  <message> 动词的修饰一般为‘形容词(副词)+地+动词’。您的意思是否是:\1<suggestion></suggestion>\3 </message>
  <short>‘地’的用法</short>
  <example type="incorrect"> 水渐渐<marker></marker>涨起来。 </example>
  <example type="correct">人慢慢地走过来了。</example>
</rule>

In Modern Chinese, there is alway a particle that connect adverb and verb or adjective and verb. The particle is always "地"; It's the same to adjective and noun. The particle is always "的". So lots of people are in state of confusion about it and replace "地" by "的". In some way, it's just like, in German, "bestimmt" in "Genitiv", "des" and "der". To solve this problem, we set 3 tokens: adjective|adverb + "的" + verb. And suggest our users replacing "的" with "地". "Skip=-1" in second token is to avoid unnecessary added ingredients. "exception" in third token can exclude ellipsis, because "是""有" are verbs. Without the exception, we can find "adjective|adverb + 的 + 是|有" naturally. However, statistical results show that "是|有" is not the object be decorated with adjective or adverb, an ellipsis is really decorated with.

Example 3: Mismatches

The rule is the fourth rule in Rulegroup, Id:S32. Which is to show you How we deal with Mismatches.

<rule>
  <pattern mark_to="-1">
    <token postag="v" skip="-1">完成</token>
    <token regexp="yes">目标|理想</token>
  </pattern>
  <message> 与‘目标’‘理想’搭配的动词不应该是‘完成’,您可以使用‘<suggestion>实现</suggestion>’与之搭配。 </message>
  <short>动宾搭配不当</short>
  <example correction="实现" type="incorrect"> 我们<marker>完成</marker>了目标。 </example>
  <example type="correct">我们终于实现了目标。</example>
</rule>

Mismatches is the most common mistake in Modern Chinese. This example shows a mismatch between verb and object noun. To solve this problem, we set 2 tokens: verb + noun; "Skip=-1" in second token is to avoid unnecessary added ingredients. "完成" means "complete" in English, while "目标" and "理想" means "goal". In Chinese grammar, it makes sense to say "实现" + "目标" (i.e. "achieve" "goals"), instead of "完成" + "目标" (i.e. "complete" "goals")

Rule Test and Evaluation

It's very important to test the Chinese rules and to evaluate how they perform in real world situations. Fortunately, we have the fast rule evaluation tool here: how-to-use-indexer-and-searcher-for-fast-rule-evaluation

We indexed a collection of 200,000 Chinese Wikipedia pages. The index helps us in the following ways:

  • To test rules with the Chinese tagger in Wikipedia pages for unexpected word splitting and tagging result.
  • To discover new grammar error patterns from Wikipedia pages and then refine the grammar rules
  • To compare real errors with false alarm errors triggered by the rules. We made sure that every rule has a precision (i.e. (real errors)/(no real errors + real errors)) bigger than 90%. As for the three examples in the previous section, we have:
    • Example 1: Quantifier, (real errors)/(false alarms + real errors) = 80/80 = 100.00%
    • Example 2: Particle, (real errors)/(false alarms + real errors) = 33/34 = 97.05%
    • Example 3: Mismatches, (real errors)/(false alarms + real errors) = 37/39 = 94.85%
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License