I came across this interesting article today on The Japan Times Online. It demonstrates how difficult Japanese really is. According to this story, not even computers can navigate the three writing systems used in Japanese! Here's a telling excerpt:
"OK, it's tough for us humans. But is it any easier from the computer's point of view?
Unfortunately, no. Experts say far more computing power is required to search Japanese text than English.
One reason is that Japanese doesn't use spaces. With
no spaces to separate words, search engines attempting to index a
document must work out for themselves where words begin and end.
Imagine having to figure out that "amanaplanacanalpanama" means, "A
man, a plan, a canal — Panama," and you get the idea. Another problem
is that Japanese commonly write words in several different ways."
Full article:
Tuesday, Oct. 2, 2007
FYI
SEARCH ENGINES
Kanji, kana trip search engines
Staff writer
Like the rest of the world, people in Japan rely on
search engines every day to tap the ocean of information that is the
World Wide Web.
 |
| Search engines' ability to locate and index Web
sites written in Japan's three primary writing systems is nothing short
of a technical miracle.
JAPAN TIMES |
|
But despite the familiarity of Google, Yahoo and other popular search engines, what goes on behind the scenes once we enter our search terms is less understood by the general public.
Even more befuddling is the question of just how
these online services manage to locate and index Web sites written in
the three scripts used in Japanese.
Following are some questions and answers about search engines and how they parse one of the world's most complex writing systems:
How does a search engine work?
A search engine is an information-retrieval system
that searches documents on the World Wide Web based on specific
keywords, producing a list of documents containing those keywords.
Search engines use small programs known as Web
crawlers, spiders or bots to go out across the Internet and copy
documents they find for later processing.
Words in target documents are stored in long data
lists called indexes. Also, copies of the original documents themselves
are stored on server computers for retrieval should the original Web
pages be updated or deleted. This is known as caching.
Search engines are able to quickly locate Web pages
containing a user's search terms by scanning the indexes rather than
exhaustively looking at every word of every document stored in their
vast archives of cached documents.
The search engine ranks the documents it finds
according to proprietary methods of Web-site analysis. Google, for
example, says on its Web site that it looks at how a document is linked
to the rest of the Web, using the "collective intelligence of the Web
to determine a page's importance." Looking inside a document, it also
"factors in fonts, subdivisions and the precise location of each word,"
as well as content of neighboring pages to turn up relevant pages.
Why can searches in Japanese be frustrating?
Anyone looking for a document containing a keyword in
Japanese must know beforehand which of the language's three writing
systems — the hiragana and katakana phonetic syllabaries or kanji — it
is written.
For example, a basic Google search for "ramen" in
hiragana finds some 3 million documents, but misses tens of millions of
Web pages where the word is written in katakana only. Meanwhile, a
literature buff who is unsure of the archaic kanji for "wagahai" (an
old way of saying "I") is likely to miss out on many pages discussing
Natsume Soseki's classic novel "Wagahai wa Neko de Aru" ("I am a Cat").
When the user is looking for a phrase, rather than
individual words, entering all the possible combinations of hiragana,
katakana and new and old kanji can be prohibitively time-consuming.
Making matters worse, the three scripts are sometimes used in
unconventional form for humorous effect, throwing search engines off
the trail of many a quirky blog.
OK, it's tough for us humans. But is it any easier from the computer's point of view?
Unfortunately, no. Experts say far more computing power is required to search Japanese text than English.
One reason is that Japanese doesn't use spaces. With
no spaces to separate words, search engines attempting to index a
document must work out for themselves where words begin and end.
Imagine having to figure out that "amanaplanacanalpanama" means, "A
man, a plan, a canal — Panama," and you get the idea. Another problem
is that Japanese commonly write words in several different ways.
So the programmers at Yahoo Japan designed their
search engine to assume a user wants to find the term "hikkoshi"
(moving, or relocation) whether the query is in the correct kanji or
kana, or in the many common — but syntactically incorrect — variations.
Yahoo Japan's search engine also tries to figure out when someone is
using an archaic kanji or an uncommon katakana construction, but the
company is quick to acknowledge the coverage is only partial.
Also, broadband-service provider NTT Resonant Inc.,
which operates the well-known Japanese portal Goo, is trying to improve
indexing by building giant databases of names for people, places and
organizations.
Perhaps more impressive from a computing standpoint,
the company is also programming its search engine to determine the
grammatical role played by each word in each document scanned. This
also improves indexing, and thus the quality of search results,
according to search services manager Masayuki Sugizaki.
At this point, how does an English-language search
compare with one in Japanese in terms of the reliability and ranking of
documents it finds?
Sugizaki said that because English is such a widely
spoken language worldwide, Web pages in that language are far more
interlinked than those in Japanese.
Because ranking is based so much on interlinkage on
the Web, Japanese-language search engines still have less information
to go on when trying to guide users to documents judged by others as
worthy. Sugizaki said Japanese search results are more likely to turn
up blog pages than English searches.
Is Japan, a country of great technological innovation, trying in any way to redefine the Web search?
Yes. This year, the government set out to take the
lead in next-generation search engines with a project to collect
consumer behavior data.
The idea is to take the Web search beyond entering
keywords or phrases into a blank and pushing the enter key. In the new
concept, computers will keep track of people's behavior and act on that
data, for example by notifying a wine buff with a GPS-equipped mobile
phone that bottles of Beaujolais Nouveau are on sale around the corner.
This year, the government allocated ¥4.6 billion for
the project, which consists of 10 partnerships with corporations and
government-affiliated organizations with expertise in related search
engine technologies.
Participants hope the move gives Japan an edge over
South Korea and Taiwan, whose high-tech proficiency is starting to
catch up with Japan. However, as exciting as it may sound to retailers,
the newfangled robo-search engines raise obvious privacy concerns.
So a government committee has been tasked with
determining what kind of personal data, and how much, should be made
available to the next generation of search technology.