Saturday, March 07, 2009

Tamil script at Unicode

Unicode.org is an attempt to provide a unique number for every character, no matter what the platform, program, or language. All Indic language scripts in the VIII Schedule of the Constitution of India are supported in the present version of the Unicode chart, except Manipuri.

Tamil script in Unicode was allotted the B80-BFF Range (128 characters, code points 2944 to 3071) in plane 0, the Basic Multi Lingual plane (BMP). The standard is widely implemented using UTF-8. A quick test - read this: "தமிழ்". Those who know Tamil, would have read "Tamil" in Tamil.

The Unicode chart for Tamil is very usable but has a painful drawback. Tamil has 12 vowels (உயிர்/uir) and 18 consonants (மெய்/mei), and for these unicode code points were obtained. Every vowel consonant combination produces உயிர்மெய்/uyirmei compounds, for which no unique code points were obtained. This means, for 216 Tamil characters, instead of using a single code point that would take 3 bytes each, 6 or more bytes would be required. While ascii chart has intelligent order, and faithfully adopted by Unicode for Latin script based languages like English, Tamil missed the bus and how?

http://www.tunerfc.tn.nic.in/ promotes "New Unicode TUNE" that is based on encodings for Tamil chars in the private user space of the Unicode charts! This is pursuant to the GO at http://www.tn.gov.in/gorders/IT/it_e_13_2006.htm Looks like the Govt of TN and the Tamil Virtual University sites at http://www.tn.gov.in/tamiltngov/sitemap.htm and http://www.tamilvu.org/ publishes pages are in TUNE.

We need to work on improving representation of Tamil in the Unicode chart, but giving the oldest living language a place in the private user space does not help Tamil. An article I wrote last year on some of the issues is available here.

The present Unicode standard for the Tamil script is usable, not withstanding the defects. Promoting TUNE is wasteful. Instead, the Government of Tamil Nadu should promote the Unicode standard for Tamil as it stands in the public SMP space, research the pros and cons thoroughly, and come with a better solution or at least leave the existing Unicode standard in peace.