eureka!

5
THE INFORMATION AGE BY BRIAN HAYES Eureka! could take a million steps; the logarith- mic method no more than twenty. The logarithmic advantage is so dra- matic that for most of human history searching a large archive was simply not feasible without some kind of sorted index or concordance. But times have changed. That "find" or "search" com- mand in the menu bar will zip through megabytes of unsorted text so fastthat you can't measure the time between clicking the mouse and seeing the result. This newfound facility for searching without sorting raisesthe question of whether the abc sequence, which hasbeen the hallmark ofliteracy for millennia-the first acade- mic accomplishment of almost every child-can endure much longer as a uni- versal cultural artifact. The fact is, you needn't know anything about the seql/ence of letters in order to read and write. That sequence is useful only when you look up a name in a phone book or a word in a dictionary-skills that may rarely be exercised in the coming computopia. So will we keep drumming the alpha- bet into the minds of preschoolers? I expect we will, at least for a few more generations. Children who are compelled to master addition and subtraction in case they are ever marooned on a desert island and have to make change for a dollar after the batteries in their calculators run down-those children are not going to escape learning their abc's. A s FAR AS I KNOW, THE FIRST MECH- anized devices that could sort and search were the punch-card machines invented in the 1880s. The cards with the famous warning not to fold, spindle or mutilate were designed by Herman Hollerith, who founded the little com- pany that grew up to be IBM. The Hol- lerith card represented information by the presence or absence of holes punched in an array of columns and rows. I have a sentimental fondness for anoth- Forget the needle in the haystack. Modern searchers often don't know what they're lookingfor until they have found it I FIND IT MILDLY CURIOUS THAT TECH- nological aids to searching have come along only in the modern era. For oth- er intellectual labors-measuring, count- ing, reckoning time-people have relied on mechanical crutches from the outset. But searchers have mostly been on their own, with no instruments comparable to the ruler, the abacus or the clock. The sole exception-the one ancient inven- tion that unquestionably had a profound effect on searching-was the alphabet. It's not the letterforms themselves that aid the search but, rather, the idea of assigning the letters a canonical sequence. Strictly speaking, alphabetizing is a device to aid sorting, not searching; it's a way of putting things in order, from first to last. But sorting is a crucial adjunct to searching. Dictionaries and telephone directories would not be of much use without the alphabetic principle. Suppose you're trying to find a spe- cific entry in a long list of words or num- bers. If the list is unordered, the best you can do is scan from top to bottom. On average you'll have to look at half the entries before you come to the target, and if you're unlucky, all of them. By contrast, a list that is sorted offers a much quicker method. Instead of plodding through, item by item, you start at the midpoint and check to see whether the item you are looking for is above or below. Discard the half that doesn't include the target and divide the remain- der of the list in half again. Continuing by halves, you soon zero in on the tar- get. The number of comparisons need- ed isproportional to the logarithm of the list length-a number much smaller than the length itself For a list of a million words, the direct and unordered search same. When the next generation of epic poets writes of the great quest, the heroes will surely begin their adventures by con- sulting the oracle of AltaVista or Google. Catherine Widgery, Word Ball with Sea Shell, 1996 E VER YI30DY'S LOOKING FOR SOME- thing-the Holy Grail, the Prom- ised Land, the Northwest Passage, the source of the Nile, the great white whale, temps perdu, Shangri-la, the end of the rainbow. The detective searches for clues, the shopper for bargains, the plumber for leaks, the programmer for bugs. On the back page of the local newspaper, M's and F's of various de- scription are in search of one another, in various permutations. Much of sci- ence is framed as a search: for new par- ticles, new planets, new theorems. As for me, I spend a lot of time searching for my keys and eyeglasses. Searching, it seems, isjust one of those things we do; it's a part of human cul- ture, if not human nature. We're the species that walks, talks and can't remem- ber where the car is parked at the airport. How intriguing, then, that after so many centuries of unaided groping and nl111- maging we finally have some power tools for searching. The Internet has brought us "search engines." Machines for find- ing things-what a concept! A search engine may not help you retrieve that earring lost in the sofa cushions, but when it comes to anything represented as bits and bytes, searching will never be the H THE SCIENCES· ,\"lJ'c,"hcr/[)('(cl/lhcr 200()

Upload: brian

Post on 10-Apr-2017

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Eureka!

THE INFORMATION AGE

BY BRIAN HAYES

Eureka!

could take a million steps; the logarith­mic method no more than twenty.

The logarithmic advantage is so dra­matic that for most of human historysearching a large archive was simply notfeasible without some kind of sortedindex or concordance. But times havechanged. That "find" or "search" com­mand in the menu bar will zip throughmegabytes ofunsorted text so fastthat youcan't measure the time between clickingthe mouse and seeing the result. Thisnewfound facility for searching withoutsorting raises the question ofwhether theabc sequence, which hasbeen the hallmarkofliteracy for millennia-the first acade­mic accomplishment of almost everychild-can endure much longer as a uni­versal cultural artifact. The fact is, youneedn't know anything about the seql/enceof letters in order to read and write. Thatsequence is useful only when you lookup a name in a phone book or a word ina dictionary-skills that may rarely beexercised in the coming computopia.

So will we keep drumming the alpha­bet into the minds of preschoolers? Iexpect we will, at least for a few moregenerations. Children who are compelledto master addition and subtraction in casethey are ever marooned on a desert islandand have to make change for a dollarafter the batteries in their calculators rundown-those children are not going toescape learning their abc's.

A s FAR AS I KNOW, THE FIRST MECH­

anized devices that could sort andsearch were the punch-card machinesinvented in the 1880s. The cards withthe famous warning not to fold, spindleor mutilate were designed by HermanHollerith, who founded the little com­pany that grew up to be IBM. The Hol­lerith card represented information bythe presence or absence of holes punchedin an array of columns and rows.

I have a sentimental fondness for anoth-

Forget the needle in the haystack. Modern searchersoften don't know what they're lookingfor

until they havefound it

IFIND IT MILDLY CURIOUS THAT TECH­

nological aids to searching have comealong only in the modern era. For oth­er intellectual labors-measuring, count­ing, reckoning time-people have reliedon mechanical crutches from the outset.But searchers have mostly been on theirown, with no instruments comparableto the ruler, the abacus or the clock. Thesole exception-the one ancient inven­tion that unquestionably had a profoundeffect on searching-was the alphabet.

It's not the letterforms themselves thataid the search but, rather, the idea ofassigning the letters a canonical sequence.Strictly speaking, alphabetizing is a deviceto aid sorting, not searching; it's a wayof putting things in order, from first tolast. But sorting is a crucial adjunct tosearching. Dictionaries and telephonedirectories would not be of much usewithout the alphabetic principle.

Suppose you're trying to find a spe­cific entry in a long list ofwords or num­bers. If the list is unordered, the best youcan do is scan from top to bottom. Onaverage you'll have to look at half theentries before you come to the target,and if you're unlucky, all of them. Bycontrast, a list that is sorted offers a muchquicker method. Instead of ploddingthrough, item by item, you start at themidpoint and check to see whether theitem you are looking for is above orbelow. Discard the half that doesn'tinclude the target and divide the remain­der of the list in half again. Continuingby halves, you soon zero in on the tar­get. The number of comparisons need­ed is proportional to the logarithm ofthelist length-a number much smaller thanthe length itself For a list of a millionwords, the direct and unordered search

same. When the next generation ofepicpoets writes ofthe great quest, the heroeswill surely begin their adventures by con­sulting the oracle ofAltaVista or Google.

Catherine Widgery, Word Ballwith Sea Shell, 1996

EVER YI30DY'S LOOKING FOR SOME­

thing-the Holy Grail, the Prom­ised Land, the Northwest Passage,

the source of the Nile, the great whitewhale, temps perdu, Shangri-la, the endof the rainbow. The detective searchesfor clues, the shopper for bargains, theplumber for leaks, the programmer forbugs. On the back page of the localnewspaper, M's and F's of various de­scription are in search of one another,in various permutations. Much of sci­ence is framed as a search: for new par­ticles, new planets, new theorems. Asfor me, I spend a lot of time searchingfor my keys and eyeglasses.

Searching, it seems, isjust one ofthosethings we do; it's a part of human cul­ture, if not human nature. We're thespecies that walks, talks and can't remem­ber where the car is parked at the airport.How intriguing, then, that after so manycenturies of unaided groping and nl111­

maging we finally have some power toolsfor searching. The Internet has broughtus "search engines." Machines for find­ing things-what a concept! A searchengine may not help you retrieve thatearring lost in the sofa cushions, but whenit comes to anything represented as bitsand bytes, searching will never be the

H THE SCIENCES· ,\"lJ'c,"hcr/[)('(cl/lhcr 200()

Page 2: Eureka!

er style of punch card, which had holesonly near the perimeter. To record datayou converted specific holes into notch­es by snipping away the bit ofpaper thatseparated the hole from the edge of thecard. I can remember experimenting withsuch a scheme in my nerdy boyhood,using three-by-five cards filched from mymother's recipe box. Of course I didn'tinvent the idea. Hollerith himself con­sidered an edge-notched design beforesettling on his column-and-row format,Moreover, I have recently learned­through the wonders ofsearch engines­that several companies manufacturededge-notched cards from the 1920s intothe 1970s. One brand, called the McBeecard, was popular in libraries for keepingtrack ofbooks and borrowers.

WHAT IS MOST CHARMING ABOUT

edge-notched cards is that theyrequire none of the motorized machin­ery of the Hollerith system. The onlyequipment needed for sorting andsearching is a knitting needle. Considera deck of thirty-two cards numberedfrom ()through 31. Along one edge, eachcard has five holes, which are assignedthe values 1,2,4, 8 and 16. Notches arecut at the holes whose values add up tothe card number. For example, card fivehas notches at holes 1 and 4, whereas cardthirteen is notched at 1, 4 and 8.

Now suppose you want to retrievecard thirteen from an unsorted stack. Youbegin by inserting the needle into hole 1and letting the cards with a notch at thatposition fall away; card thirteen shouldbe among the dropped cards, so you gath­er them up for further processing, puttingthe other cards aside. Now, with thereduced deck, move the needle to hole2, but this time keep the cards thatremain on the skewer and set aside theones that drop off. At holes 4 and 8 youselect the cards that fall, and finally at hole16 you retain the un-notched card thatstayson the needle-there should be onlyone, and it should be card thirteen.

Finding one McBee card out of thir­ty-two takes five skewering operations.It's no coincidence that 5 is the logarithmto the base 2 of32. (In other words, 25 =32.) As that relation suggests, the card­search process is a logarithmic one.Admittedly, with only thirty-two cardsall this rigmarole hardly seems worth thebother, but the system begins to payoffwith larger decks. Logarithms grow slow­ly. To find one card in a thousand wouldtake just ten operations; one in a million,just twenty. That performance is impres­sive by any standard.

And there's more. If you're adroit

enough to knit with severalneedles at once, .it's possibleto search an edge-notcheddeck in a single step. Tomake the trick work, eachcard has to have a second setof holes and notches com­plementary to the first set;that is, wherever a holeappears in the first set, thesecond set has a notch, andvice versa. With such a dou­ble-punched deck, youmerely insert needles at theright positions, give a goodshake, and the unique select­ed card fallsout. The perfor­mance is even better thanlogarithmic; it is a "constant­time" algorithm, which runsjust as fast no matter howlarge the deck is.

If edge-notched cards aresuch hot stuff, why haven'tthey replaced all those clunkyold computers on our desks?W ell, for one thing, it wouldtake a knitting needle 800feet long to handle a millioncards. Then there's the job ofpunching all those notches.And, finally, the simplesearch described above­picking one number out ofa finite sequence of num­bers-is probably not thebest benchmark for evaluat­ing the merits of cardboardversus silicon.

TH E CLASSIC TESTING

ground for search algo­rithms is a procedure knownas exact string matching. Incomputer science a string is a sequence ofcharacters-letters, numbers, spaces,punctuation-with no higher-level struc­tures such as paragraphs or pages. Thinkof a string as a novel written on tickertape. The search task is to find all theoccurrences of one string (called the pat­tern) inside another string (the text). Usu­ally the pattern string ("great whitewhale") is much smaller than the text(Moby Dick).

A moment's thought suggests an algo­rithm. Write the pattern and the text onseparate strips of paper, and line themup on their leftmost characters. Now,working from left to right, compare eachcharacter in the pattern with the corre­sponding character in the text; if theyare all identical, note that you havefound a match. Now slide the patternone place to the right and compare all

the characters again. Continue until therightmost character in the pattern slidesoff the end of the text. For a pattern ofp characters and a text of t characters,the number of comparisons needed isapproximately p X t.

Another moment's thought suggestsan easy speedup: you don't always haveto compare all the characters. Since amatch has to be exact, you can stop com­paring and slide the pattern along to itsnext position as soon as you come to thefirst mismatch. To what extent does thisshortcut improve performance? If youhappen to be searching for the patternxxx in the text xxxxxxxxxx, the short­cut will never be taken, so it's not help­ful at all. But that's worst-case behavior.More typically, the shortcut can reducethe number of comparisons from p X tto something closer to p + t.

10 THE SCIENCES· ,\'",'('II,[,('rIO('(('III[,er 2000

Page 3: Eureka!

A FTER ALMOST TWENTY-FIVE YEARS,

the Boyer-Moore algorithm re­mains the fastest known method formany kinds of searches, and yet it is notquite the last word. One major draw­back is that it works only for exactstringmatches. In many a quest, you don't real­ly know what you're looking for, thoughyou hope you'll recognize it when yousee it. Such approximate pattern match­ing turns out to be a much harder prob­lem than exact searching.

It's possible to leave some slack in thesearch process by including "wild cards"in the pattern-a practice known vari­ously as glob bing or grepping, depend­ing on how it's done. Under glob rules,typing in fo* will bring up [o, [oo, fog,fork and much else; according to grepsyntax, the matches for fo* would be Jj(J, [oo, j(JOO, [oooo and so on. The twoprocesses look much the same, differingonly enough to ensure the occasionaldisastrous mistake.

In recent years the strongest impetusfor the development of approximatesearch algorithms has come from mol­ecular biologists. The decoding ofgenomes has given rise to a corpus oftext that runs to several billion charac­ters, all written in a four-letter alphabetthat ribosomes read fluently but thatpeople find opaque. Approximatematching is the only way to find any­thing interesting in this archive. Forexample, Homo sapiens has thousands ofgenes in common with the fruit flyDrosophila meianogaster, but because ofsubtle variations none of those sharedgenes would be recognized as the sameby an exact string search.

Thus genome searches attempt to mea­sure the similarity between sequences,usually in terms ofan "edit distance": theminimum number of changes needed toconvert one string into the other. Threekinds of editorial change must be takeninto account: substitutions, insertions anddeletions. Of the three, substitutions areeasy enough to cope with because theyare strictly local, affecting only one letterat a time. Insertions and deletions com­plicate matters because they alter thealignment ofthe string over a large region.Yet no adequate genetic search can be

Continued 011 Page 46

and efficiency, but writing a computerprogram that puts the algorithm intoaction is not as easy as I have made itseem. Note that my chosen patternword, algorithm, has no repeated letters.Figuring out how far ahead you can safe­ly skip is trickier when you're search­ing for banana or Mississippi,

algorithmalgebrai c algor i thms

are a waste of time, and thatno alignment ofthe first fourcharacters can possibly work.Although computers are notas good at glancing, with alittle preliminary work-bybuilding a table of the char­acter positions within thepattern-the computer willrecognize that it can safelyratchet the pattern along tothe fifth position.

That idea underlies asearch procedure workedout in the 1970s by JamesH. Morris ofCarnegie Mel­lon University in Pittsburgh,Pennsylvania, and DonaldE. Knuth and Vaughan R.Pratt, both of Stanford Uni­versity. The Knuth-Morris­Pratt algorithm never exam­ines any character in the textmore than once, so it isguar­anteed to require no morethan p + t comparisons.

Another technique canoften do even better. Again,line up the pattern at the startof the text:

As in the earlier algorithms,the pattern moves from leftto right along the text. Butin this algorithm, you readthe pattern backward, begin­ning with the last letter, asyou compare it with the text.That change of directionmight seem insignificant­after all, the letters are thesame no matter which way

you read them-but, in fact, swimming"upstream" has an important advantage.In this example, the advantage is appar­ent assoon asyou find the first mismatch,between m in the pattern and c in thetext. Because the letter cdoes not appearanywhere in the pattern, you can safelyshift the pattern forward by its entirelength, skipping nine positions to theright in the text. Through such leapfrogtactics, the algorithm searches a textwithout ever looking at most ofits char­acters. Although the worst-case perfor­mance is still about p + t comparisons,typical results are much better.

The upstream algorithm was invent­ed around 1975 by Robert S. Boyer andJ Strother Moore, both of the Univer­sity of Texas at Austin, and indepen­dently by R. William Gosper. It iswidely admired for its sheer cleverness

MAKIN G FURTHER IMPROVEMENTS

requires more than a moment'sthought. Indeed, the quest to improveexact string search algorithms occupiedan entire community of computer sci­entists, offand on, throughout the 1960sand 1970s. The techniques that came outof that effort are ingenious, but simpleand obvious they are not.

And yet the basic principles are not somysterious. Suppose you are searchingfor the pattern algorithm in the text alge­braic algorithms. Starting from the left, thefirst three characters are confirmed tomatch, but at the fourth position you findan 0 opposite an e. At this point the naivealgorithm would slide the pattern oneplace to the right and start the charac­ter-by-character comparison again. Cer­tainly, though, a human searcher couldsee at a glance that all those comparisons

James Crable, Telephone Boxes, London, England, 1986

Novcmb crr Dcccnibcr 2000 • THE SCIENCES 11

Page 4: Eureka!

THE INFORMATION AGE

Continued from Page 11

conducted without considering them; asgenes evolve, bits of DNA are constant­ly being snipped out or spliced in.

Approximate genome searching iscomputationally expensive. Given twostrings of length n, finding the edit dis­tance between them takes roughly n2

steps-which makes the process farslower than checking for an exactmatch. The job turns out to be so labo­rious that it simply cannot be done rou­tinely when screening a new DNAsequence against a large genome data­base. Instead, the screening programsdo an approximate approximate stringmatch: they estimate the edit distancewithout actually finding the optimumsequence of editing operations.

THE INTERNET WOULD APPEAR TO BE

the ideal application for search al­gorithms-the perfect test case for meth­ods that have continued to be refinedover the years. Ironically, though, whensearching the Net one encounters diffi­culties that make classic search algorithmsill-suited to the task. Part of the prob­lem is the monstrous size of the Inter­net, which would overwhelm any algo­rithm that attempted to searchsequentially, character by character.Another complication is that peoplesearching the Net are seldom trying toretrieve a specific document. Instead,their searches are a form ofexploration­an attempt to find "what's out there" onsome topic. .

The need for Internet search tools hadalready been recognized a decade ago,before the W orld-Wide Web made theInternet a mass medium. In that quaint,pre-Web era, which now seems asremote as the age of Morse-code teleg­raphy, people searched the Net withArchie or WAIS, the Wide-Area Infor­mation Service, and what resulted fromthe search was an unadorned list of filenames. When the Web wasborn, the firstsearch aids were manually compileddirectories, the ancestors ofsuch servicesasYahoo! and the volunteer Open Direc­tory Project. Such directories have theimportant virtue of organizing the con­tent of the Web by subject, providing ataxonomy ofknowledge that issimilar toa library's subject catalogue. Unfortu­nately, though, maintaining the hierar­chical structure requires so much effortthat such handcrafted directories take inonly a small fraction of the total Web.

Search engines, which emerged inthe early 1990s, have broader coverage;

some claim to index more than a thou­sand million pages. But because of thatvast scale, the search process bears lit­tle resemblance to what happens whenyou click the "find" command in aword-processing program. Instead,search engines rely on a three-phaseprocess. The first phase is carried outlong before you log on and type in yoursearch term. A program called thecrawler, or spider, surveys the Web, sys­tematically visiting pages and retrievingtext and other content. Crawling theWeb takes weeks or months, which iswhy a search sometimes sends you topages that no longer exist.

The crawler brings back gigabytes orterabytes of data-too much for anysequential search algorithm to handle.Thus to make fast searches possible, thecollection ofretrieved documents has tobe organized in some way, and that isthe second phase of the engine's opera­tion. In principle, the text could be sort­ed alphabetically to enable a logarithmicsearch, but there are better schemes.Most search engines rely on a methodcalled the vector-space model. At theheart of the model is a gigantic matrixwith a column for every document anda row for every anticipated search term.Thus the matrix could have a million

OFTEN YOU DON'T

know what you'relooking for, though youhope you'll recognize

it when you see it.

columns and 100,000 rows. The entriesin the matrix record how many timeseach term appears in each document.

The giant matrix is put to use in thethird and final phase ofa search engine'scycle-s-the response to your query. Theprocess is straightforward: for each wordin the query, the engine goes to the appro­priate row of the matrix and chooses allcolumns that have a non-zero entry. Theprocess takes only milliseconds-and it'sa good thing, too. Large search enginesget millions ofqueries a day, which worksout to some hundreds a second; they'vegot to pump out answers at the same rateor the input hopper will overflow.

A s IT HAPPENS, FINDING DOCU­

ments that match a given query isthe easiest part of the search engine'sjob. The real challenge is figuring outthe best way to dump the majority of

those matches. With at least a thousandmillion Web pages out there, queriesroutinely yield 10,000 or 100,000 hits.The search engine must somehowdivine which of those documents bestmeet the user's needs and list them first.

The enormous matrix built into thevector-space model is the key to thewinnowing process. The matrix can begiven a geometric interpretation. Therows (corresponding to the potentialsearch terms) define the dimensions ofan imaginary space, and the entries ineach column (representing a document)give the coordinates of a point in thatspace. For a small matrix-say, threerows and five columns-it's easy to seehow this plan works. The rows corre­spond to length, width and heightdimensions, and the columns define fivepoints that lie somewhere inside thisvolume. Adding another 99,997 spatialdimensions makes the model harder tovisualize, but mathematically it worksthe same way.

How can such a bristling, multidi­mensional abstraction help you find yourway around the Web? It's remarkablystraightforward. The search engine sim­ply treats the query as if it were anotherWeb document brought back by thecrawler. It talliesall the terms in the queryand plots the position ofthe correspond­ing point in the vector-space model. Thebest answers then become the documentsthat lie nearest the query point.

THE ALGORITHMS THAT RANK THE

hits are becoming more sophisti­cated. The innovation that I find mostinteresting was pioneered by the Googlesearch engine, which was developed bySergey Brin and Lawrence Page whenthey were still doctoral students at Stan­ford University. In its rankings, Googleconsiders not only the content ofa pagebut also the number ofother pages thathave links to it. Presumably, people whocreate such links think the page is worthvisiting, so an abundance oflinks can betaken as a strong recommendation.

Not only do Web search engineshave to find and rank pages, but theyalso have to cope with pages designedto manipulate or deceive; for example,some Web designers rig the system byadding invisible extraneous keywords.Such tactics would be a bizarre aberra­tion in other contexts-you don'texpect the books in a library catalogueto disguise their content and vie for thenotice ofreaders-but on the Web, thesearcher and the searchee are oftenadversaries. Moreover, the search en­gines themselves can turn traitor: some

46 THE SCIENCES • November/December 2000

Page 5: Eureka!

accept payment from the owners ofWeb pages for preferential placement.

NEW TECHNOLOGIES THAT TRANS­

form daily life tend to dazzle andannoy us in equal measure. My cellulartelephone is a wonderful conveniencewhen I'm caught in traffic, but yours isan abomination when it rings in the the­ater. The automated teller machine is amarvel ofthe age when it hands me mon­ey in a foreign city, but not when it re­fusesmy card at the ATM around the cor­ner. Search engines seem to fall into thesame category, innovations we love tohate. They are windows into a world of

that alwaysthe moral ofthe quest genre-­that the search for meaning ends by find­ing meaning in the search itself?

Some weeks ago, curious about thoseedge-notched cards that fascinated mein childhood, I launched a Web search.Soon I was led to the Dead Media Pro­ject, which turned out to be a celebra­tion of bygone information technolo­gies. Among the dead media was theIndecks Information Retrieval System,a set ofedge-notched cards that StewartBrand reviewed in the Last Whole EarthCatalog in 1971. There was also a refer­ence to McBee cards, which was how Ifirst learned of their existence.

ration, in the period after the punch-cardcompany merged with the Royal Type­writer Company. I learned that the Roy­al-McBee LGP-30 was the computer onwhich Edward Lorenz discovered deter­ministic chaos. I even found the Website of the company in its current incar­nation, McBee Systems, Inc., which isnow a purveyor of office supplies.

THIS LONG CHAIN OF LINKS WAS

leading me in fascinating direc­tions-but not where I wanted to go.And, despite several hours ofdiligent anddiverting searches, I never did find any­thing more about McBee cards on the

Patrick Hughes, BookLook, 1997

riches, except when the windows arebroken or closed or dirty or distorted.

Indeed, searching for the term "win­dows" is all it takes to illustrate thefrustrations of Web searching. Googlereturns "about 20,700,000" documents;needless to say, I didn't look at allofthem,but the first few dozen had nothing to dowith panes ofglass-and quite a lot to dowith a certain software manufacturer.

For now, though, I remain far moreenchanted than frustrated. Accessto thosethousand million documents is priceless,even if the path to them is sometimesindirect. Consulting Google has becomemy first reflex for answering every ques­tion. What was the first Internet searchengine? Who said, "Cherchez lafemme"?What is the Echelon program? And it'snot just the answers that delight, if youever find them. The very process ofsearching has charms ofitsown, and moreoften than not the best outcome is find­ing what you weren't looking for. Isn't

Searching for "indecks" turned updozens of Web pages, but they were amixed and mysterious lot. One bore thetitle "And once devoured, through eter­nal night one lollipop glows steady likea ..." Not much help there. It took awhile to deduce the unifying thread:nearly all ofthe hits were pages in whichthe file name "index.html" had beenmisspelled "indecks.html."

Searching for "McBee card" took mefirst to the crash of a B-25 bomber in1943 (I still don't understand why).More useful was a work of fiction byCharles Brownson titled "The AltairLizard," in which a character explains theedge-notched mechanism, speculates onthe fate of the McBee Company andcomplains of difficulty in ordering sup­plies. Following further leads, I discov­ered a dozen scattered copies of a nerd­humor posting called "The Legend ofMel, the Real Programmer." Melworked for the Royal-McBee Corpo-

Web. Instead, I found what I was seek­ing in the library, where I searched byrunning my finger slowly along thespines of books until I came to a fifty­year-old volume by Howard F. McGawtitled Marginal Punched Cards in CollegeandResearch Libraties. Thumbing throughits pages, I learned much ofwhat I want­ed to know about McBee cards-forinstance, that the first patents were issuedin 1896, and that the main users werelibraries and chemists. So here was a casein which gigabytes and terabytes and1OO,OOO-dimensional vector spaceswereoutdone by the Dewey decimal system.Evidently, there are still places the Netcan't take us.

But don't gloat, Luddites. The nextperson who asksa Web search engine tofind McBee cards will very likely bereferred to thisdocument.•

BRIAN HAYES isafreelance writer andafor­mereditor ifAMERICAN SCIENTIST.

November/December 2000· THE SCIENCES 47