Instead of relying on symbolic representations, a third
approach consists in (1) analyzing the co-occurrence of
words in large corpora in order to draw semantic similarities
and (2) relying on very simple structures, namely high-
dimensional vectors, to represent meanings. In this
approach, the unit is the word. The meaning of a word is not
defined per se, but rather determined by its relationships
with all others. For instance, instead of defining the
meaning of bicycle in an absolute manner (by its properties,
function, role, etc.), it is defined by its degree of association
to other words (i.e., very close to bike, close to pedals, ride,
wheel, but far from duck, eat, etc.). This semantic
information can be established from raw texts, provided that
enough input is available. This is exactly what human
people do: it seems that most of the words we know, we
learn by reading (Landauer & Dumais, 1997). The reason is
that most words appear almost only in written form and that
direct instruction seems to play a limited role. Therefore, we
would learn the meaning of words mainly from raw texts,
by mentally constructing their meaning through repeated
exposure to appropriate contexts.
Relying on direct co-occurrence
One way to mimic this powerful mechanism would be to
rely on direct co-occurrences within a given context unit. A
usual unit is the paragraph which is both computationally
easy to identify and of reasonable size. We would say that:
R1: words are similar if they occur in the same paragraphs.
Therefore, we would count the number of occurrences of
each word in each paragraph. Suppose we use a 5,000-
paragraph corpus. Each word would be represented by
5,000 values, that is by a 5,000 dimension vector. For
instance:
avalanche: (0,1,0,0,0,0,1,0,2,0,0,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0,0...)
snow: (0,2,0,0,0,0,0,0,1,1,0,0,0,0,0,0,2,1,1,0,1,0,0,0,0,0,0.)
This means that the word avalanche appears once in the 2nd
paragraph, once in 7th, twice in the 9th, etc. One could see
that, given the previous rule, both words are quite similar:
they co-occur quite often. A simple cosine between the two
vectors can measure the degree of similarity. However, this
rule does not work well (Perfetti, 1998; Landauer, 2002):
two words should be considered similar even if they do not
co-occur. French & Labiouse (2002) think that this rule
might still work for synonyms because writers tend not to
repeat words, but use synonyms instead. However, defining
semantic similarity only from direct co-occurrence is
probably a serious restriction.
Relying on higher-order co-occurrence
Therefore, another rule would be:
R1*: words are similar if they occur in similar paragraphs.
This is a much better rule. Consider the following two
paragraphs:
Bicycling is a very pleasant sport. It helps keeping a good
health.
For your fitness, you can practice bike. It is very nice and
good to your body.
Bicycling and bike appear in similar paragraphs. If this is
repeated over a large corpus, it would be reasonable to
consider them similar, even if they never co-occur within
the same paragraph. Now we need to define paragraph
similarity. We could say that two paragraphs would be
similar if they share words, but that would be restrictive: as
illustrated in the previous example, two paragraphs should
be considered similar although they do not have words in
common (functional words are usually not taken into
account). Therefore, the rule is:
R2: paragraphs are similar if they contain similar words.
Rules 1* and 2 constitute a circularity, but this can be
solved by a specific mathematical procedure called singular
value decomposition, which is applied to the occurrence
matrix. This is exactly what LSA does. To state it in other
words, LSA is not only based on direct co-occurrence, but
rather on higher-order co-occurrence. Kontostahis &
Pottenger (2002) have shown that these higher-order co-
occurrences do appear in large corpora.
LSA consists in reducing the huge dimensionality of
direct word co-occurrences to its best N dimensions. All
words are then represented as N-dimensional vectors.
Empirical tests have shown that performance is maximal for
N around 300 for the whole general English language
(Landauer et al., 1998; Bellegarda, 2000) but this value can
be smaller for specific domains (Dumais, 2003). We will
not describe the mathematical procedure which is presented
in details elsewhere (Deerwester, 1990; Landauer et
al., 1998). The fact that word meanings are represented as
vectors leads to two consequences. First, it is straight-
forward to compute the semantic similarity between words,
which is usually the cosine between the corresponding
vectors, although others similarity measures can be used.
Examples of semantic similarities between words from a
12.6 million word corpus are (Landauer, 2002):
cosine(doctor, physician) = .61
cosine(red, orange) = .64
Second, sentences or texts can be assigned a vector, by a
simple weighted linear combination of their word vectors.
This is a powerful feature of a semantic representation to be
able to go easily from words to texts. An example of
semantic similarity between sentences is:
cosine(the cat was lost in the forest, my little feline
disappeared in the trees) = .66
Modeling children's semantic memory
Semantic space
As we mentioned before, our goal was to rely on LSA to
define a reasonable approximation of children's semantic
memory. This is a necessary step for simulating a variety of
children cognitive processes.
LSA itself obviously cannot form such a model: it needs
to be applied to a corpus. We gathered French texts that
approximately correspond to what a child is exposed to: