eureka!
TRANSCRIPT
THE INFORMATION AGE
BY BRIAN HAYES
Eureka!
could take a million steps; the logarithmic method no more than twenty.
The logarithmic advantage is so dramatic that for most of human historysearching a large archive was simply notfeasible without some kind of sortedindex or concordance. But times havechanged. That "find" or "search" command in the menu bar will zip throughmegabytes ofunsorted text so fastthat youcan't measure the time between clickingthe mouse and seeing the result. Thisnewfound facility for searching withoutsorting raises the question ofwhether theabc sequence, which hasbeen the hallmarkofliteracy for millennia-the first academic accomplishment of almost everychild-can endure much longer as a universal cultural artifact. The fact is, youneedn't know anything about the seql/enceof letters in order to read and write. Thatsequence is useful only when you lookup a name in a phone book or a word ina dictionary-skills that may rarely beexercised in the coming computopia.
So will we keep drumming the alphabet into the minds of preschoolers? Iexpect we will, at least for a few moregenerations. Children who are compelledto master addition and subtraction in casethey are ever marooned on a desert islandand have to make change for a dollarafter the batteries in their calculators rundown-those children are not going toescape learning their abc's.
A s FAR AS I KNOW, THE FIRST MECH
anized devices that could sort andsearch were the punch-card machinesinvented in the 1880s. The cards withthe famous warning not to fold, spindleor mutilate were designed by HermanHollerith, who founded the little company that grew up to be IBM. The Hollerith card represented information bythe presence or absence of holes punchedin an array of columns and rows.
I have a sentimental fondness for anoth-
Forget the needle in the haystack. Modern searchersoften don't know what they're lookingfor
until they havefound it
IFIND IT MILDLY CURIOUS THAT TECH
nological aids to searching have comealong only in the modern era. For other intellectual labors-measuring, counting, reckoning time-people have reliedon mechanical crutches from the outset.But searchers have mostly been on theirown, with no instruments comparableto the ruler, the abacus or the clock. Thesole exception-the one ancient invention that unquestionably had a profoundeffect on searching-was the alphabet.
It's not the letterforms themselves thataid the search but, rather, the idea ofassigning the letters a canonical sequence.Strictly speaking, alphabetizing is a deviceto aid sorting, not searching; it's a wayof putting things in order, from first tolast. But sorting is a crucial adjunct tosearching. Dictionaries and telephonedirectories would not be of much usewithout the alphabetic principle.
Suppose you're trying to find a specific entry in a long list ofwords or numbers. If the list is unordered, the best youcan do is scan from top to bottom. Onaverage you'll have to look at half theentries before you come to the target,and if you're unlucky, all of them. Bycontrast, a list that is sorted offers a muchquicker method. Instead of ploddingthrough, item by item, you start at themidpoint and check to see whether theitem you are looking for is above orbelow. Discard the half that doesn'tinclude the target and divide the remainder of the list in half again. Continuingby halves, you soon zero in on the target. The number of comparisons needed is proportional to the logarithm ofthelist length-a number much smaller thanthe length itself For a list of a millionwords, the direct and unordered search
same. When the next generation ofepicpoets writes ofthe great quest, the heroeswill surely begin their adventures by consulting the oracle ofAltaVista or Google.
Catherine Widgery, Word Ballwith Sea Shell, 1996
EVER YI30DY'S LOOKING FOR SOME
thing-the Holy Grail, the Promised Land, the Northwest Passage,
the source of the Nile, the great whitewhale, temps perdu, Shangri-la, the endof the rainbow. The detective searchesfor clues, the shopper for bargains, theplumber for leaks, the programmer forbugs. On the back page of the localnewspaper, M's and F's of various description are in search of one another,in various permutations. Much of science is framed as a search: for new particles, new planets, new theorems. Asfor me, I spend a lot of time searchingfor my keys and eyeglasses.
Searching, it seems, isjust one ofthosethings we do; it's a part of human culture, if not human nature. We're thespecies that walks, talks and can't remember where the car is parked at the airport.How intriguing, then, that after so manycenturies of unaided groping and nl111
maging we finally have some power toolsfor searching. The Internet has broughtus "search engines." Machines for finding things-what a concept! A searchengine may not help you retrieve thatearring lost in the sofa cushions, but whenit comes to anything represented as bitsand bytes, searching will never be the
H THE SCIENCES· ,\"lJ'c,"hcr/[)('(cl/lhcr 200()
er style of punch card, which had holesonly near the perimeter. To record datayou converted specific holes into notches by snipping away the bit ofpaper thatseparated the hole from the edge of thecard. I can remember experimenting withsuch a scheme in my nerdy boyhood,using three-by-five cards filched from mymother's recipe box. Of course I didn'tinvent the idea. Hollerith himself considered an edge-notched design beforesettling on his column-and-row format,Moreover, I have recently learnedthrough the wonders ofsearch enginesthat several companies manufacturededge-notched cards from the 1920s intothe 1970s. One brand, called the McBeecard, was popular in libraries for keepingtrack ofbooks and borrowers.
WHAT IS MOST CHARMING ABOUT
edge-notched cards is that theyrequire none of the motorized machinery of the Hollerith system. The onlyequipment needed for sorting andsearching is a knitting needle. Considera deck of thirty-two cards numberedfrom ()through 31. Along one edge, eachcard has five holes, which are assignedthe values 1,2,4, 8 and 16. Notches arecut at the holes whose values add up tothe card number. For example, card fivehas notches at holes 1 and 4, whereas cardthirteen is notched at 1, 4 and 8.
Now suppose you want to retrievecard thirteen from an unsorted stack. Youbegin by inserting the needle into hole 1and letting the cards with a notch at thatposition fall away; card thirteen shouldbe among the dropped cards, so you gather them up for further processing, puttingthe other cards aside. Now, with thereduced deck, move the needle to hole2, but this time keep the cards thatremain on the skewer and set aside theones that drop off. At holes 4 and 8 youselect the cards that fall, and finally at hole16 you retain the un-notched card thatstayson the needle-there should be onlyone, and it should be card thirteen.
Finding one McBee card out of thirty-two takes five skewering operations.It's no coincidence that 5 is the logarithmto the base 2 of32. (In other words, 25 =32.) As that relation suggests, the cardsearch process is a logarithmic one.Admittedly, with only thirty-two cardsall this rigmarole hardly seems worth thebother, but the system begins to payoffwith larger decks. Logarithms grow slowly. To find one card in a thousand wouldtake just ten operations; one in a million,just twenty. That performance is impressive by any standard.
And there's more. If you're adroit
enough to knit with severalneedles at once, .it's possibleto search an edge-notcheddeck in a single step. Tomake the trick work, eachcard has to have a second setof holes and notches complementary to the first set;that is, wherever a holeappears in the first set, thesecond set has a notch, andvice versa. With such a double-punched deck, youmerely insert needles at theright positions, give a goodshake, and the unique selected card fallsout. The performance is even better thanlogarithmic; it is a "constanttime" algorithm, which runsjust as fast no matter howlarge the deck is.
If edge-notched cards aresuch hot stuff, why haven'tthey replaced all those clunkyold computers on our desks?W ell, for one thing, it wouldtake a knitting needle 800feet long to handle a millioncards. Then there's the job ofpunching all those notches.And, finally, the simplesearch described abovepicking one number out ofa finite sequence of numbers-is probably not thebest benchmark for evaluating the merits of cardboardversus silicon.
TH E CLASSIC TESTING
ground for search algorithms is a procedure knownas exact string matching. Incomputer science a string is a sequence ofcharacters-letters, numbers, spaces,punctuation-with no higher-level structures such as paragraphs or pages. Thinkof a string as a novel written on tickertape. The search task is to find all theoccurrences of one string (called the pattern) inside another string (the text). Usually the pattern string ("great whitewhale") is much smaller than the text(Moby Dick).
A moment's thought suggests an algorithm. Write the pattern and the text onseparate strips of paper, and line themup on their leftmost characters. Now,working from left to right, compare eachcharacter in the pattern with the corresponding character in the text; if theyare all identical, note that you havefound a match. Now slide the patternone place to the right and compare all
the characters again. Continue until therightmost character in the pattern slidesoff the end of the text. For a pattern ofp characters and a text of t characters,the number of comparisons needed isapproximately p X t.
Another moment's thought suggestsan easy speedup: you don't always haveto compare all the characters. Since amatch has to be exact, you can stop comparing and slide the pattern along to itsnext position as soon as you come to thefirst mismatch. To what extent does thisshortcut improve performance? If youhappen to be searching for the patternxxx in the text xxxxxxxxxx, the shortcut will never be taken, so it's not helpful at all. But that's worst-case behavior.More typically, the shortcut can reducethe number of comparisons from p X tto something closer to p + t.
10 THE SCIENCES· ,\'",'('II,[,('rIO('(('III[,er 2000
A FTER ALMOST TWENTY-FIVE YEARS,
the Boyer-Moore algorithm remains the fastest known method formany kinds of searches, and yet it is notquite the last word. One major drawback is that it works only for exactstringmatches. In many a quest, you don't really know what you're looking for, thoughyou hope you'll recognize it when yousee it. Such approximate pattern matching turns out to be a much harder problem than exact searching.
It's possible to leave some slack in thesearch process by including "wild cards"in the pattern-a practice known variously as glob bing or grepping, depending on how it's done. Under glob rules,typing in fo* will bring up [o, [oo, fog,fork and much else; according to grepsyntax, the matches for fo* would be Jj(J, [oo, j(JOO, [oooo and so on. The twoprocesses look much the same, differingonly enough to ensure the occasionaldisastrous mistake.
In recent years the strongest impetusfor the development of approximatesearch algorithms has come from molecular biologists. The decoding ofgenomes has given rise to a corpus oftext that runs to several billion characters, all written in a four-letter alphabetthat ribosomes read fluently but thatpeople find opaque. Approximatematching is the only way to find anything interesting in this archive. Forexample, Homo sapiens has thousands ofgenes in common with the fruit flyDrosophila meianogaster, but because ofsubtle variations none of those sharedgenes would be recognized as the sameby an exact string search.
Thus genome searches attempt to measure the similarity between sequences,usually in terms ofan "edit distance": theminimum number of changes needed toconvert one string into the other. Threekinds of editorial change must be takeninto account: substitutions, insertions anddeletions. Of the three, substitutions areeasy enough to cope with because theyare strictly local, affecting only one letterat a time. Insertions and deletions complicate matters because they alter thealignment ofthe string over a large region.Yet no adequate genetic search can be
Continued 011 Page 46
and efficiency, but writing a computerprogram that puts the algorithm intoaction is not as easy as I have made itseem. Note that my chosen patternword, algorithm, has no repeated letters.Figuring out how far ahead you can safely skip is trickier when you're searching for banana or Mississippi,
algorithmalgebrai c algor i thms
are a waste of time, and thatno alignment ofthe first fourcharacters can possibly work.Although computers are notas good at glancing, with alittle preliminary work-bybuilding a table of the character positions within thepattern-the computer willrecognize that it can safelyratchet the pattern along tothe fifth position.
That idea underlies asearch procedure workedout in the 1970s by JamesH. Morris ofCarnegie Mellon University in Pittsburgh,Pennsylvania, and DonaldE. Knuth and Vaughan R.Pratt, both of Stanford University. The Knuth-MorrisPratt algorithm never examines any character in the textmore than once, so it isguaranteed to require no morethan p + t comparisons.
Another technique canoften do even better. Again,line up the pattern at the startof the text:
As in the earlier algorithms,the pattern moves from leftto right along the text. Butin this algorithm, you readthe pattern backward, beginning with the last letter, asyou compare it with the text.That change of directionmight seem insignificantafter all, the letters are thesame no matter which way
you read them-but, in fact, swimming"upstream" has an important advantage.In this example, the advantage is apparent assoon asyou find the first mismatch,between m in the pattern and c in thetext. Because the letter cdoes not appearanywhere in the pattern, you can safelyshift the pattern forward by its entirelength, skipping nine positions to theright in the text. Through such leapfrogtactics, the algorithm searches a textwithout ever looking at most ofits characters. Although the worst-case performance is still about p + t comparisons,typical results are much better.
The upstream algorithm was invented around 1975 by Robert S. Boyer andJ Strother Moore, both of the University of Texas at Austin, and independently by R. William Gosper. It iswidely admired for its sheer cleverness
MAKIN G FURTHER IMPROVEMENTS
requires more than a moment'sthought. Indeed, the quest to improveexact string search algorithms occupiedan entire community of computer scientists, offand on, throughout the 1960sand 1970s. The techniques that came outof that effort are ingenious, but simpleand obvious they are not.
And yet the basic principles are not somysterious. Suppose you are searchingfor the pattern algorithm in the text algebraic algorithms. Starting from the left, thefirst three characters are confirmed tomatch, but at the fourth position you findan 0 opposite an e. At this point the naivealgorithm would slide the pattern oneplace to the right and start the character-by-character comparison again. Certainly, though, a human searcher couldsee at a glance that all those comparisons
James Crable, Telephone Boxes, London, England, 1986
Novcmb crr Dcccnibcr 2000 • THE SCIENCES 11
THE INFORMATION AGE
Continued from Page 11
conducted without considering them; asgenes evolve, bits of DNA are constantly being snipped out or spliced in.
Approximate genome searching iscomputationally expensive. Given twostrings of length n, finding the edit distance between them takes roughly n2
steps-which makes the process farslower than checking for an exactmatch. The job turns out to be so laborious that it simply cannot be done routinely when screening a new DNAsequence against a large genome database. Instead, the screening programsdo an approximate approximate stringmatch: they estimate the edit distancewithout actually finding the optimumsequence of editing operations.
THE INTERNET WOULD APPEAR TO BE
the ideal application for search algorithms-the perfect test case for methods that have continued to be refinedover the years. Ironically, though, whensearching the Net one encounters difficulties that make classic search algorithmsill-suited to the task. Part of the problem is the monstrous size of the Internet, which would overwhelm any algorithm that attempted to searchsequentially, character by character.Another complication is that peoplesearching the Net are seldom trying toretrieve a specific document. Instead,their searches are a form ofexplorationan attempt to find "what's out there" onsome topic. .
The need for Internet search tools hadalready been recognized a decade ago,before the W orld-Wide Web made theInternet a mass medium. In that quaint,pre-Web era, which now seems asremote as the age of Morse-code telegraphy, people searched the Net withArchie or WAIS, the Wide-Area Information Service, and what resulted fromthe search was an unadorned list of filenames. When the Web wasborn, the firstsearch aids were manually compileddirectories, the ancestors ofsuch servicesasYahoo! and the volunteer Open Directory Project. Such directories have theimportant virtue of organizing the content of the Web by subject, providing ataxonomy ofknowledge that issimilar toa library's subject catalogue. Unfortunately, though, maintaining the hierarchical structure requires so much effortthat such handcrafted directories take inonly a small fraction of the total Web.
Search engines, which emerged inthe early 1990s, have broader coverage;
some claim to index more than a thousand million pages. But because of thatvast scale, the search process bears little resemblance to what happens whenyou click the "find" command in aword-processing program. Instead,search engines rely on a three-phaseprocess. The first phase is carried outlong before you log on and type in yoursearch term. A program called thecrawler, or spider, surveys the Web, systematically visiting pages and retrievingtext and other content. Crawling theWeb takes weeks or months, which iswhy a search sometimes sends you topages that no longer exist.
The crawler brings back gigabytes orterabytes of data-too much for anysequential search algorithm to handle.Thus to make fast searches possible, thecollection ofretrieved documents has tobe organized in some way, and that isthe second phase of the engine's operation. In principle, the text could be sorted alphabetically to enable a logarithmicsearch, but there are better schemes.Most search engines rely on a methodcalled the vector-space model. At theheart of the model is a gigantic matrixwith a column for every document anda row for every anticipated search term.Thus the matrix could have a million
OFTEN YOU DON'T
know what you'relooking for, though youhope you'll recognize
it when you see it.
columns and 100,000 rows. The entriesin the matrix record how many timeseach term appears in each document.
The giant matrix is put to use in thethird and final phase ofa search engine'scycle-s-the response to your query. Theprocess is straightforward: for each wordin the query, the engine goes to the appropriate row of the matrix and chooses allcolumns that have a non-zero entry. Theprocess takes only milliseconds-and it'sa good thing, too. Large search enginesget millions ofqueries a day, which worksout to some hundreds a second; they'vegot to pump out answers at the same rateor the input hopper will overflow.
A s IT HAPPENS, FINDING DOCU
ments that match a given query isthe easiest part of the search engine'sjob. The real challenge is figuring outthe best way to dump the majority of
those matches. With at least a thousandmillion Web pages out there, queriesroutinely yield 10,000 or 100,000 hits.The search engine must somehowdivine which of those documents bestmeet the user's needs and list them first.
The enormous matrix built into thevector-space model is the key to thewinnowing process. The matrix can begiven a geometric interpretation. Therows (corresponding to the potentialsearch terms) define the dimensions ofan imaginary space, and the entries ineach column (representing a document)give the coordinates of a point in thatspace. For a small matrix-say, threerows and five columns-it's easy to seehow this plan works. The rows correspond to length, width and heightdimensions, and the columns define fivepoints that lie somewhere inside thisvolume. Adding another 99,997 spatialdimensions makes the model harder tovisualize, but mathematically it worksthe same way.
How can such a bristling, multidimensional abstraction help you find yourway around the Web? It's remarkablystraightforward. The search engine simply treats the query as if it were anotherWeb document brought back by thecrawler. It talliesall the terms in the queryand plots the position ofthe corresponding point in the vector-space model. Thebest answers then become the documentsthat lie nearest the query point.
THE ALGORITHMS THAT RANK THE
hits are becoming more sophisticated. The innovation that I find mostinteresting was pioneered by the Googlesearch engine, which was developed bySergey Brin and Lawrence Page whenthey were still doctoral students at Stanford University. In its rankings, Googleconsiders not only the content ofa pagebut also the number ofother pages thathave links to it. Presumably, people whocreate such links think the page is worthvisiting, so an abundance oflinks can betaken as a strong recommendation.
Not only do Web search engineshave to find and rank pages, but theyalso have to cope with pages designedto manipulate or deceive; for example,some Web designers rig the system byadding invisible extraneous keywords.Such tactics would be a bizarre aberration in other contexts-you don'texpect the books in a library catalogueto disguise their content and vie for thenotice ofreaders-but on the Web, thesearcher and the searchee are oftenadversaries. Moreover, the search engines themselves can turn traitor: some
46 THE SCIENCES • November/December 2000
accept payment from the owners ofWeb pages for preferential placement.
NEW TECHNOLOGIES THAT TRANS
form daily life tend to dazzle andannoy us in equal measure. My cellulartelephone is a wonderful conveniencewhen I'm caught in traffic, but yours isan abomination when it rings in the theater. The automated teller machine is amarvel ofthe age when it hands me money in a foreign city, but not when it refusesmy card at the ATM around the corner. Search engines seem to fall into thesame category, innovations we love tohate. They are windows into a world of
that alwaysthe moral ofthe quest genre-that the search for meaning ends by finding meaning in the search itself?
Some weeks ago, curious about thoseedge-notched cards that fascinated mein childhood, I launched a Web search.Soon I was led to the Dead Media Project, which turned out to be a celebration of bygone information technologies. Among the dead media was theIndecks Information Retrieval System,a set ofedge-notched cards that StewartBrand reviewed in the Last Whole EarthCatalog in 1971. There was also a reference to McBee cards, which was how Ifirst learned of their existence.
ration, in the period after the punch-cardcompany merged with the Royal Typewriter Company. I learned that the Royal-McBee LGP-30 was the computer onwhich Edward Lorenz discovered deterministic chaos. I even found the Website of the company in its current incarnation, McBee Systems, Inc., which isnow a purveyor of office supplies.
THIS LONG CHAIN OF LINKS WAS
leading me in fascinating directions-but not where I wanted to go.And, despite several hours ofdiligent anddiverting searches, I never did find anything more about McBee cards on the
Patrick Hughes, BookLook, 1997
riches, except when the windows arebroken or closed or dirty or distorted.
Indeed, searching for the term "windows" is all it takes to illustrate thefrustrations of Web searching. Googlereturns "about 20,700,000" documents;needless to say, I didn't look at allofthem,but the first few dozen had nothing to dowith panes ofglass-and quite a lot to dowith a certain software manufacturer.
For now, though, I remain far moreenchanted than frustrated. Accessto thosethousand million documents is priceless,even if the path to them is sometimesindirect. Consulting Google has becomemy first reflex for answering every question. What was the first Internet searchengine? Who said, "Cherchez lafemme"?What is the Echelon program? And it'snot just the answers that delight, if youever find them. The very process ofsearching has charms ofitsown, and moreoften than not the best outcome is finding what you weren't looking for. Isn't
Searching for "indecks" turned updozens of Web pages, but they were amixed and mysterious lot. One bore thetitle "And once devoured, through eternal night one lollipop glows steady likea ..." Not much help there. It took awhile to deduce the unifying thread:nearly all ofthe hits were pages in whichthe file name "index.html" had beenmisspelled "indecks.html."
Searching for "McBee card" took mefirst to the crash of a B-25 bomber in1943 (I still don't understand why).More useful was a work of fiction byCharles Brownson titled "The AltairLizard," in which a character explains theedge-notched mechanism, speculates onthe fate of the McBee Company andcomplains of difficulty in ordering supplies. Following further leads, I discovered a dozen scattered copies of a nerdhumor posting called "The Legend ofMel, the Real Programmer." Melworked for the Royal-McBee Corpo-
Web. Instead, I found what I was seeking in the library, where I searched byrunning my finger slowly along thespines of books until I came to a fiftyyear-old volume by Howard F. McGawtitled Marginal Punched Cards in CollegeandResearch Libraties. Thumbing throughits pages, I learned much ofwhat I wanted to know about McBee cards-forinstance, that the first patents were issuedin 1896, and that the main users werelibraries and chemists. So here was a casein which gigabytes and terabytes and1OO,OOO-dimensional vector spaceswereoutdone by the Dewey decimal system.Evidently, there are still places the Netcan't take us.
But don't gloat, Luddites. The nextperson who asksa Web search engine tofind McBee cards will very likely bereferred to thisdocument.•
BRIAN HAYES isafreelance writer andaformereditor ifAMERICAN SCIENTIST.
November/December 2000· THE SCIENCES 47