UTIL: Unified Transliteration of Indic Languages
:NOTE: Download PDF of this article. You need good Unicode fonts to read
the article. I recommend Noto font family from Google. You can
it. IPA symbols are within square brackets, like [ʂ] and transliterated symbols
are within slashes, like /ṭ/.
UTIL is a romanization scheme for Indic languages. It is designed as pan-Indian transliteration scheme. It covers 20+ languages: Bengali, Dogra, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Lepcha, Limbu, Manipuri (Meitei), Maithili, Malayalam, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Santali, Sindhi, Sinhala, Tamil, Telugu, Urdu and probably many more.
So, why yet another scheme?
- IAST is insufficient. It serves Sanskrit and Pali but is incomplete for pretty much everything else (e.g. Bengali, Gujarati, etc.).
- ISO–15919 is also insufficient. It ignores Kashmiri and Sindhi, which are integral Indian languages. Plus, it lacks symbols for newly-assigned Unicode codepoints (e.g. ॹ or ॺ). Also कृष्ण /Kṛṣṇa/ is typographically more consistent than /Kr̥ṣṇa/.
- ALA-LC is designed as a single-language model ignoring the inherent similarity of Brahmic scripts. This leads to inconsistencies. For example, Tamil ழ /ḻ/ and Kannada ೞ /l̤/ correspond to same character and sound (“retorflex approximant”) and yet have different representations. Conversely, the same symbol /ṣ/ represents Hindi ष [ʂ] and Urdu ص [sˤ] even though they’re completely different sounds.
- Other schemes like Hunterian or Gretil are as bad as the above or even worse sometimes.
So, how is UTIL better?
- Covers the entire character set of ISO15919 plus more (Kashmiri, Sindhi)
- Long vowels always have macron above (ā, ē, …)
- Aspirated consonants always have ‘h’ as second letter (kh, gh, …)
- Minimum number of diacritical marks:
- Only four diacritics are used: “dot above”, “dot below”, “macro above”, “macron below” (or their combination).
- Only three diacritics are needed for Sanskrit, instead of IAST’s five.
- Prefers precomposed characters in Unicode repertoire, but not required.
Primary vowels and diphthongs:
Additional ones, all have a dot below:
Consonants with their Sanskrit names:
Affricate glide ॺ (‘JJYA’) is transcribed /j̄/.
|Anusvāra: ṃ||Anunāsika: ̐||Avagraha: ’|
|Visarga: ḥ||Jihvāmūlīya: x̣||Upadhmānīya: ẋ|
|Vedic Udātta: ́||Svarita (independent): ̀||Anudātta: ̱|
|Arabic hamza ء: ʼ||Arabic ain ع: ʽ|
|Rising tone: ˊ||Falling tone: ˋ||Neutral tone: ˙|
Udātta and svarita use combining grave and acute accent respectively. Whereas hamza and ain use non-combining modifier letters U+02BC and U+02BD respectively. Tone modifiers are used in Maithili, Dogra and other Pahari languages.
- Anunāsika is denoted by a combining candrabindu. Note the difference between हंस (swan) /haṃsa/ and हँस (laugh) /ha̐s/. Diacritic only on second letter in a digraph. Example: हैँ /hai̐/
- A colon is used to denote vowel hiatus or resolve ambiguity. Example: बई /ba:i/ (not /bai/)
- ॠ /ṝ/, ऌ /ḷ/ and ॡ /ḹ/ are used only in Sanskrit.
- ऎ /e/ = short ए in Southern scripts (எ, എ, ಎ, ఎ)
- ऒ /o/ = short ओ in Southern scripts (ஒ, ഒ, ಒ, ఒ)
- ऍ /ẹ/ = Gujarati ઍ, Sinhala ඇ, pronounced [æ] as in “bat”
- ऑ /ọ̄/ = Gujarati ઑ, pronounced [ɔː] as in “ball”
- ड़ /ṙa/ = Bengali ড়, Punjabi ੜ, Oriya ଡ଼ (RRA, “retroflex flap”)
- ढ़ /ṙha/ = Bengali ঢ়, Oriya ଢ଼ (RHA, “aspirated retroflex flap”)
- ळ /ḻa/ = used in Marathi, Tamil ள, Malayalam ള, Kannada ಳ, Telugu ళ (LLA, “retroflex lateral approximant”)
- ऴ /ḻ̇a/ = Tamil ழ, Malayalam ഴ, Kannada ೞ, Telugu ఴ (LLLA, “retroflex approximant” = zha)
- ऩ /ṉa/ = Tamil ன , Kannada ನ಼, Malayalam ഩ (NNNA, “alveolar n”)
- ऱ /ṟa/ = Tamil ற, Malayalam റ, Kannada ಱ, Telugu ఱ (RRA, “alveolar r”)
- र् /ṟ/ = repha in Marathi
- य़ /ẏa/ = য in Bengali and Oriya, while য /y/ = য়
- व़ /wa/ = Urdu و, Assamese ৱ, Oriya ୱ
- ख़ /ḵẖa/ = used in Urdu, Punjabi ਖ਼
- ग़ /ġa/ = used in Urdu, Punjabi ਗ਼
- च़ /ċa/ = used in Kashmiri, Telugu ౘ
- ज़ /za/ = Urdu ز, Gurmukhi ਜ਼, Bengali জ়, Kannada ಜ಼, Telugu ౙ [d͡z]
- ॹ, झ़ /zh/ = Urdu ژ, Gujarati ૹ, Avestan uses ॹ
- Kashmiri vowels ạ, ạ̄ , ọ, ọ̄ , ụ, ụ̄ are pronounced [ə], [əː], [ɔ], [ɔː], [ɨ], [ɨː]. Another vowel form, ॵ , is sometimes used for [ɔ] (and [ɔː] is skipped altogether). These symbols are taken from ALA-LC as it is and follows Wikipedia. Kashmiri consonants: च़ [t͡s], छ़ [t͡sʰ] and ज़ [z]
- Sindhi implosives: ॻ /g̱/, ॼ /j̱/, ॾ /ḏ/ and ॿ /ḇ/
- Sinhalese nasals: ඟ /ṇ̄ga/, ඥ /ṇ̄ja/, ඬ /ṇ̄da/, ඳ /ṇ̄ḍa/ and ඹ /ṃ̄ba/
- Sinhala long vowel ඈ and Devanagari vowel sign candra long E (U+0955), used in Avestan, are transliterated /ẹ̄/
- The six Malayalam chillu characters represent dead consonants (without implicit vowel). As such, the are simply transliterated without adding an a next to the consonant. Hence, ൿ, ൽ, ൾ, ൻ, ൺ and ർ are respectively transliterated as /k-/, /l-/, /ll-/, /n-/, /nn-/ and /rr-/.
Perso-Arabic characters are chosen in a non-conflicting way with the Brahmic scripts. Urdu introduces six sounds [f, z, ʒ, q, x, ɣ] on top of Hindi (see Hindustani phonology). Note that [f, z, x, ɣ] are fricatives, just like ष [ʂ], स [s], ह [h]. Excluding these IPA signs, the ones in the below table are indicative only.
Input methods for IME:s
Of course, a transliteration scheme is not so useful if it cannot be entered into a computer, for which Input Method Editors (IMEs) are used. This can be thought of as an ASCII transliteration of UTIL.
Emacs input method can be found indic-roman-postfix.el, which is a postfix input method (i.e., diacritics are entered after the character).
indic-util.mim is an m17n input method can be used with many IME’s based on libm17n like iBus, uim and fcitx.