Provided by Cognitive Sciences ePrint Archive
IJCSI International Journal of Computer Science Issues, Vol. 3, 2009 45
ISSN (Online): 1694-0784
ISSN (Print): 1694-0814
Implementation of Rule Based Algorithm for Sandhi-Vicheda Of
Compound Hindi Words
Priyanka Gupta1 ,Vishal Goyal 2
1M.Tech. (ICT) Student, 2Lecturer
Department of Computer Science
Punjabi University Patiala
Abstract
Sandhi means to join two or more words to coin new
word. Sandhi literally means `putting together' or
combining (of sounds), It denotes all combinatory
sound-changes effected (spontaneously) for ease of
pronunciation. Sandhi-vicheda describes [5] the process
by which one letter (whether single or cojoined) is
broken to form two words. Part of the broken letter
remains as the last letter of the first word and part of the
letter forms the first letter of the next letter. Sandhi-
Vicheda is an easy and interesting way that can give
entirely new dimension that add new way to traditional
approach to Hindi Teaching. In this paper using the
Rule based algorithm we have reported an accuracy of
60-80% depending upon the number of rules to be
implemented.
Keywords: Rule Based Algorithm, Sandhi-Vicheda,
Compound Hindi Words
I INTRODUCTION
Natural Language Processing (NLP) refers to
descriptions that attempt to make the computers
analyze, understand and generate natural languages,
enabling one to address a computer in a manner as one
is addressing a human being. Natural Language
Processing is both a modern computational technology
and a method of investigating and evaluating claims
about human language itself. It is a subfield of artificial
intelligence and computational linguistics. It studies the
problems of automated generation and understanding
of natural human languages.
A word can be defined as a sequence of
characters delimited by spaces, punctuation marks, etc.
in case of written text. A compound word (also known
as co-joined word) can be broken up into two or more
independent words. A Sandhi-Vicheda module breaks
the compound word in a sentence into constituent
words. Sandhis take place whenever there is a presence
of a swara i.e.a vowel; the presence of a consonant
with a halanta; the presence of a visarga. Sanskrit has a
well defined set of rules for Sandhi-vicheda. But Hindi
has its own rules of Sandhi-vicheda. They are,
however, not so well-defined as, and much fewer in
number than, the Sanskrit rules.
1.1 The Hindi Language
Hindi is spoken in northern and central India. Linguists
think of Hindi and Urdu as the same language, the
difference being that Hindi [5] is written in the
Devanagari script and draws much of its vocabulary
from Sanskrit, while Urdu is written in the Persian
script and draws a great deal of its vocabulary from
Persian and Arabic. More than 180 million people in
India regard Hindi as their mother tongue. Another 300
million use it as second language. Hindi is the national
language of India and is spoken by almost half a billion
people in India and throughout the world and is the
world's second most spoken language. It allows you to
communicate with a far wider variety of people in
India than English which is only spoken by around five
percent of the population. It is written in an easy to
learn phonetic script called “Devanagari” which is also
used to write Sanskrit, Marathi and Nepali. Hindi is
normally spoken using a combination of 52 sounds, ten
vowels, 40 consonants, nasalisation and a kind of
aspiration. These sounds are represented in the
Devanagari script by 52 symbols: for ten vowels, two
modifiers and 40 consonants.
II RELATED WORK
Sandhi (in linguistics) [1] is a cover term for a wide
variety of phonological processes that occur at
morpheme or word boundaries, such as the fusion of
sounds across word boundaries and the alteration of
sounds due to neighboring sounds or due to the
grammatical function of adjacent words. Internal
sandhi features the alteration of sounds within words
at morpheme boundaries, as in sympathy (syn- +
pathy). External sandhi refers to changes found at
word boundaries, such as in the pronunciation [tεm
bʊks] for ten books. This is not true of all dialects of
English. The Linking R of some dialects of English is a
kind of external sandhi, as is the process called liaison
in the French language. While it may be extremely
common in speech, sandhi (especially external) is
typically ignored in spelling, as is the case in English,
with the exception of the distinction between "a" and
IJCSI