Using The Rule Converter Gui

How to start it

After building the code base with ant, do

java -jar RuleConverterGUI.jar

Usage

Loading files

Files can be loaded either by typing the full file path into the "Rule file:" pane near the bottom or choosing "Open" from the File menu. The program will attempt to automatically choose the file type, but if this doesn't work, select the proper rule type from the combo box. To convert the file, press "Convert" or Alt+c. You can then navigate through the generated rules by using Ctrl+Up/Down.

Checking rule coverage

The "Check rule coverage" button generates examples and checks for existing rule coverage for all converted rules. "Check displayed rule coverage" just checks the displayed rule. Checking all the rules sometimes takes a while because the RuleCoverage class occasionally has to generate examples by iterating through the dictionary.

For some of the generated After the Deadline rules, the generated examples aren't long enough to trigger the existing LT rules. For example, an AtD rule "there're .*/NN" generates the pattern sequence:

<token>there</token>
<token>'</token>
<token>re</token>
<token postag="NN|NN:UN?" postag_regexp="yes"></token>

But there's already a LT rule on file (THERE_RE_MANY) that has an almost equivalent pattern sequence:

<token>there</token>
<token>'</token>
<token>re</token>
<token postag="NN:.*|NN|NNP" postag_regexp="yes"><exception postag="NNS|NNPS|JJ.*|DT" postag_regexp="yes"></exception></token>
<token><exception postag="NNS|NNPS" postag_regexp="yes"></exception></token>

But checking the rule coverage for the converted AtD rule will generate a sequence like "there're banana", which won't be caught by the existing LT rule because of the additional token at the end. So the "Toggle extra tokens" button increases the scope of the rule coverage by putting blank tokens at the beginning and end of the pattern. In this case, the new converted pattern would be caught by the existing LT rule. This trick could result in false alarms, most often WORD_REPEAT_RULE. Since checking an individual rule is usually pretty fast, I'll just recheck the rule a few times to see if it's already covered.

If you get a rule that's already covered by a (false alarm) existing LT rule, like WORD_REPEAT_RULE, you can remove the "covered" flag by choosing "Remove covering rules" from the Edit menu. This'll cause the rule to get written out with the non-covered rules.

Displaying rules

Rule Converter produces both disambiguation and grammar rules. These are categorized differently depending on what format the original rule file is in. For example, constraint grammar files only produce disambiguation rules and After the Deadline files produce regular rules for all rules except false alarm (filter=kill or die) rules. These can be toggled with the "Show regular/disambiguation rules" check boxes. You can also toggle the display of rules with/without warnings and, after the rules have been checked for existing rule coverage (see above), with/without existing coverage.

Editing rules

The displayed rule in the "Converted rule:" pane can be edited. Changes that are typed into the text pane are not saved until "Save rule" (Ctrl+s or Alt+s) is pressed. Changes that are generated (rule exclusivity, extra tokens, or default="off") are automatically saved. You can also select the option to edit the entire files before write by checking "Edit rules before writing" and then choosing "Write rules to file".

The original rule file can actually be opened and edited within RuleConverterGUI. Press Ctrl+W or select "Show original rule file" from the File menu. Clicking "Save" saves your changes, so be careful!

After the Deadline

There are a few caveats for converting After the Deadline rules.

Tagging

After the Deadline uses a Brill-like single-tag tagger. I.e. it doesn't keep around all the possible tags for a given word like LT does. Rules like the above example ("there're NN") make much more sense in then AtD framework, where there'll be far fewer false alarms. In order to avoid the glut of false alarms that can arise from directly converting these rules to LT format, you can make the POS tags exclusive ("Make rule exclusive" button). So

<token postag="NN"/>

would become
<token postag="NN"><exception postag="NN" negate_pos="yes"/></token>

Unfortunately, this could make new rules much less useful. False alarms would be drastically diminished in the "there're" example, but the new rule now wouldn't catch phrases like "There're mouse in the cabinets" because "mouse" can also be a verb. This problem can really only be resolved through hand-tuning.

Apostrophes

Apostrophes have to be tokenized in LT, while they do not in AtD. Translating between the two could mess up the numbering for match elements in the suggestions. For example, the AtD rule "shouldn't|couldn't|wouldn't .*/VBN::word=\0 \1:base" produces the suggestion

Did you mean <suggestion><match no="1"/> <match no="2" postag="VB" /></suggestion>?

Though this would probably convey the general point of the correct grammar, the proper suggestion would be
<suggestion><match no="1"/><match no="2"/><match no="3"/> <match no="4" postag="VB" /></suggestion>?

since the contractions are split into three separate tokens.

File types and where to get them

After the Deadline rule files can be found in AtD's source code. I suggest you check out the latest snapshot from the repository. All the rules can be found in data/rules/. Some of the files in this directory (e.g. definitions.txt) don't contain rules, but auxiliary information used by the rules and the general AtD engine. The main rules directory is the grammar/ directory.

There are two After the Deadline file types: "default" and "avoid". Default files are the main rule files, as found in the grammar/ directory. Avoid files are the simple databases of phrase \t suggestion, like avoiddb.txt or biasdb.txt.

Constraint grammar

Constraint grammar rules haven't been fully converted like After the Deadline rules since we're waiting for a new language that needs disambiguation, like Icelandic, to test this out. However, you can see its capability by checking out some CG files from the translation software Apertium. Full installation of Apertium is pretty involved, but you can check it out here. You should get a specific language pair that has associated constraint grammar files (like Icelandic-English, check out here. The CG file here is apertium-is-en.is-en.rlx. It's a medium-sized CG file; much smaller than the Norwegian Nynorsk CG (52,000 lines), but larger than the Bulgarian CG (73 lines; four rules).

Currently the conversion is pretty buggy. Some CG rules need to be split into multiple LT rules in order to be represented. A good test run is converting the Bulgarian CG file, where you can see that all the rules are converted correctly.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License