IJCSI International Journal of Computer Science Issues, Vol. 3, 2009
46
ISSN (Online): 1694-0784
ISSN (Printed): 1694-0814
"an" (sandhi is, however, reflected in the writing
system of Sanskrit and Hindi). External sandhi effects
can sometimes become morphologized. Most tonal
languages have Tone sandhi, in which the tones of
words alter according to pre-determined rules. For
example: Mandarin has four tones: a high monotone, a
rising tone, a falling-rising tone, and a falling tone. In
the common greeting nï hao, both words in isolation
would normally have the falling-rising tone. However,
this is difficult to say, so the tone on nï is pronounced
as m (but still written nι in Hanyu Pinyin).
The Sanskrit Sandhi engine software is not currently
available as a standalone application, since its local use
demands the installation of an HTTP server on the
user's host.
The Sandhi module[1] developed by RCILTS-
Sanskrit, Japanese, Chinese at Jawaharlal Nehru
University, New Delhi. RCILTS, JNU is a resource
center for Sanskrit language of DIT, Government of
India. At JNU work started in three languages viz.,
Sanskrit, Japanese, and Chinese. Using this module the
user can get the information about Sandhi rules and
processes. Sutra number in Astyadhayi and its
description is displayed. User can learn three types of
Svara Sandhi, Vyanjan Sandhi, Hal Sandhi through this
Sandhi module Data is in Unicode. Sandhi exceptions
and options are also incorporated. This module takes
two words as input. First word cannot be null but
second word can be. A user can input the two words
and submit the form to get the result of the given input.
Chinese Tone Sandhi,[2] Cheng and Chin-Chuan
from California University, Berkeley, Phonology
Laboratory faced the problem that English stresses are
interpreted by Chinese speakers when they speak
Chinese with Engish words inserted. Chinese speakers
in the United States usually speak Chinese with Engish
words inserted. In Mandarin Chinese, a tone-sandhi
rule changes a third tone preceding another third tone
to a second tone. Using the tone-sandhi rule, they
designed the experiment to find out hoe English
stresses are interpreted in Chinese sentences. Stress
does not exist in the underlying representations of
English phonology. But in studying bilingual
phenomena, the phonetic level is also important. Fry
(1995) found that when a vowel was long and of high
intensity, listeners agreed that the vowel was strongly
stressed. The results of his experiments indicate that
the duration ratio has a stronger influence on
judgements of stress than has the intensity ratio.
Lehiste and Peterson (1959) also reported experiments
on stress.
English l-sandhi [3] involves an allophonic alternation
in alveolar contact for word-final /l/ in connected
speech [4]. EPG data for five Scottish Standard English
and five Southern Standard British English speakers
shows that there is individual and dialectal variation in
contact patterns.
III PROBLEM DEFINITION
Developing programs that understand a natural
language is a difficult task. Natural languages are large.
They contain an infinity of different sentences. No
matter how many sentences a person has heard or seen,
new ones can always be produced. Also, there is much
ambiguity in a natural language. Many words have
several meanings and sentences can have different
meanings in different contexts. Compound words are
created by joining an arbitrary number of existing
words together, and this can lead to a large increase of
the vocabulary size, and thus also to sparse data
problems. Therefore the problem of compound words
poses challenges for many NLP applications. The
problem domain, to which this paper is concerned, is
breaking up of Hindi compound words into constituent
words. In Hindi, words are a sequence of characters.
These words are combined with ‘swar’, ‘vyanjan’, and
matra’s. Hindi has its own rules of Sandhi-vicheda.
They are, however, not so well-defined as, and much
fewer in number than, the Sanskrit rules. So my
problem is to break the compound word into
constituent words with the help of rules of ‘Sandhi-
vicheda’ in Hindi grammar. My problem is to design a
Graphical User Interface, which accepts input as a
Hindi language word (source text) from the keyboard
or mouse and break it into constituent words (target
text). The source text is converted into target text in
Unicode Format.
Compound Word |
Sandhi-Vicheda |
ijk/fhu |
ij $ v/fhu |
HfkOkFf |
HfkO $ VFf |
Rioiy; |
f’ko $ v∣y; |
dO∣un |
dfO $ bn |
xΦτ |
ХЧ $ ⅛k |
ije?oj |
ije $ b?oj |
,dd |
,d $ ,d |
;Fd |
;Fkk $ ,d |
ijkidfj |
ij $ midj |
ιfU∕∣pNn |
lfU∕k $ Nn |
fOpNn |
fO $ Nn |
IJCSI