Sorcery in Indian language spelling

April 28, 2007

Introduction

Spell checking software has always had the aura of black magic, which is probably why another meaning of spell is the enchantment cast by a magician. The best-known, and most widely used open-source spell-checking engine is aspell, and thanks to the needs of some folk at an Indian language search site, we had a chance to delve into the guts of aspell, in order to customise it for Hindi.

Relevant aspell features

aspell has many useful features that made it well-suited to our task:

  • Unicode (UTF-8) support: Via translation to an internal 8-bit format.
  • Support for phonetic features: Invaluable for Indian languages, which are spelt phonetically.
  • Other features: Allow tailoring to new languages. Includes items like affix rules, replacement tables, keyboard layout specifications, etc.
  • Good documentation: Was important in trying to figure out the internal details of aspell.

Customisation for Hindi.

It turns out that no immediate changes were required to the aspell code. Instead, it was sufficient to customise various aspell features to Hindi. The chief of these were:

  • Phonetic rules: There are various aspell options for including phonetic information about a language, ranging from simple methods which do not work as well, to complex techniques which take a fair amount of time, but with corresponding returns in terms of the quality of the spell-checking. As phonetic information is expected to be of importance for Indian languages, we use the comprehensive, table-driven phonetic code mechanism. The rules allow specifying, for example, that क sounds like ख, क like क्, कि like की, कु like कू, etc. This turns out to be the single most important factor for improving the performance for Hindi.
  • Optimisation of internal settings: Various aspell quantities, such as the sounds-like weight, the costs for edit-distance calculations, etc., were presumably optimised for English, and we did an extensive study to re-optimise them for Hindi.
  • Affix rules: aspell allows affixes (prefixes, and suffixes) to be automatically applied to words. Thus, for example, storing नदी in the dictionary, also covers the plural नदियाँ, if the appropriate affix rule ( ी—> ि_याँ ) is included.
  • Replacement tables: Besides phonetic rules, a one-to-one replacement table can also be used to handle common mis-spellings.
  • Keyboard layout specifications: A class of mis-spelled words arise from typographical errors made due to the proximity of keys on a keyboard, such as “scsn’’ instead of “scan’’. These errors can be given priority if the layout of the actual keyboard in use is provided to aspell.
  • Run-together words: It is possible to check for words that have been accidentally run together, such as “catbird’’ in place of “cat bird’’.

Performance summary

Based on the above work, the Hindi spell-checker now performs on par or better than the English equivalent, which is quite remarkable, considering that the original developer has no knowledge of Indian languages. A performance metric for a spell-checker could be to take a sample list of mis-spellings, feed them to the spell-checker, and check:

  • Whether the known correct word is suggested?
  • If so, what is its position in the replacement list?

The table below shows such a comparison for Hindi against the default English engine, and against the best (but, much slower) English engine (CAVEAT: Other factors, such as the extent and quality of the dictionary, and the comprehensiveness of the sample word list factor into this, and too much should not be read into the actual numbers.)

Category Hindi Default Eng. Best Eng.
Not found 5% 6% 2%
1 71% 59% 60%
1-5 91% 86% 83%
1-10 94% 91% 90%
Any 95% 94% 98%

Other work

Various other tasks were taken up as part of this project, all of which have been, or will be released as open-source:

  • Patches for aspell have been submitted that provide hooks to internal quantities to allow for tuning the performance to a new language.
  • aspell has bindings only in C. With the use of SWIG, we have put together bindings in a variety of programming languages. The bindings operate firstly as low-level wrappers around the C functions. More natural, class-based interfaces are then built around this low-level code in programming languages that provide support for classes. This is currently available for Python, Perl, and C#, and will be released soon after some clean-up. A separate write-up on this work is also being prepared.
  • Testing framework: Based on the SWIG bindings for C#, we built a GUI testing framework that allows easy access to aspell internals. This was done in Mono.NET, and works cross-platform across Linux, and Microsoft Windows (can also be used under Mac OSX, Solaris, etc.) using the Mono runtime. A separate write-up on this work is being prepared.
  • OpenOffice aspell plugin: The OpenOffice office suite currently uses Hunspell as its default spell-checker, but an aspell plugin would be very useful. This is being worked on.
  • A comprehensive Hindi dictionary, including affix rules, is under preparation.
  • We plan to apply the knowledge gleaned from the Hindi work to other Indian languages. In particular, we have started working with Prof. G.S. Lehal of Punjab University on Punjabi.
  • While aspell functions pretty well for Hindi, a morphological spell-checker that can use contextual information (e.g., the gender of the noun can be used to narrow down the modified spelling of the associated verb), will also be of value. An engine like this could also become a more general-purpose grammar analyser.

Advertisements

4 Responses to “Sorcery in Indian language spelling”

  1. आपका प्रयास स्तुत्य है। सभी को इसमें सहयोग करना चाहिए। इसका पहला संस्करण शीघ्र उपयोगकर्ताओं को मिल सके, इसके लिए सम्मिलित प्रयास की जरूरत है।

  2. In
    http://cmwiki.sarai.net/index.php/PhoneticDetails
    some Devangari Characters (specially maatra) are not visible, as stand-along maatra has rendering problems. So the Unicode Code Numbers may be given there in brackets.

  3. buckycat said

    http://cmwiki.sarai.net/index.php/PhoneticDetails

    How are you viewing the site? Looks fine with Firefox under Linux. It is a lot of work to add Unicode code points, and it will soon be possible to get the entire file from the aspell dictionary distribution. Besides, the above is a Wiki page, and any registered person can edit it.

  4. udai said

    Nice research. Hindi definitely have a lot of potential on the internet as well, couple of us are working on a hindi portal (www.hinkhoj.com) have a look at that, will be good to have your comments.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: