Higher Education in the Information Age

Digital Libraries: The Revolution in Scholarly Information

Michael Lesk


Paul Mosher:What we're trying to do in the spirit of what we're talking about today is identify the academic domain on Internet. Since clearly both advertising and entertainment control vast economic resources, if somebody doesn't worry about management of the content of the Internet--if we don't do it, nobody will.

I'd like to welcome you to the session on Digital Libraries: The Revolution in Scholarly Information. "Information" is the information in the "Information Age". It's the subject of scholarly information. Information is both the food and the product of colleges and universities that we all represent. It's the only, so far as I know, absolutely inexhaustible human resource. As long as Homosapiens exists, there will be information. Information, however, is a very Silly Putty word, I think, for those of you who think about it. Now it is, and I will use it in the broadest sense here to mean really words, data, knowledge, objects, sounds, information properly so called--you name it, it's included I think today.

Now, the amount of information available in the world has increased exponentially each year--this is published information--since about the middle of the 18th century, the Age of Enlightenment, spurred by the growth and spread of popular learning, the popular press, and the invention of scientific information. The rate of information growth has further accelerated as a result of the introduction of digital information. Earlier you talked about junk. Even the non-junk is becoming too much.

During the last several hundred years, societies have looked to libraries to provide repositories for information and to make it available to readers, lookers, listeners, and scholars. The beginning of the Information Age, which I believe is more accurately termed "The Age of the Small Machine", has radically transformed the nature of libraries. For two generations, technologists predicted the death of libraries, the function of which would simply be replaced by machine functions. Not only has that disappearance failed to take place, but the role of libraries in the Age of the Small Machine is becoming clearer, if not truly clear. In fact, all kind of assemblies or stores of digital information or data exist. They are often called, another approximate term, "digital libraries".

What are digital libraries? Will they end up replacing libraries as we have known them? What does it all mean? What will be the role of libraries in a higher education environment, transformed by the uses of technology? To answer these and certainly to introduce others, we have two very different kinds of librarians with us today who are going to talk about digital libraries, different kinds of digital libraries, and what they may mean for us.

The first of our two speakers, and I will introduce them seriatim and then ask if you'll hold your comments until both have spoken, is Michael Lesk, who on the one hand I identified as Chief Scientist at Bellcore, on the other hand I identified from his home page as big enchilada of the computer science research department at Bellcore. And any of you who are interested in extremely knowledgeable and entertaining home pages should look up Michael's, if only to see his three photographs: the professional one, the amateur one, and the coin operated one. Michael worked with the group that built UNIX and wrote UNIX tools for word processing. He's worked on a large chemical information system, the Core Project, with Cornell, OCLC, ACS, and CAS. He's been a visiting professor in computer science at University College, London and has received a number of other honors and awards. I'm looking forward to the appearance of his new promised book, Practical Digital Libraries: Books, Bites, and Bucks which is supposed to appear this summer.

Michael Lesk: O.K. What I'm going to say, basically, is that technically, we know how to build digital libraries. What we don't know how to do is make a self-sustaining system that will pay for itself. And so the message I'm going to give you is that libraries need to be expansionist. I don't know how to solve the economic problems within the context of libraries as they're defined today. I see some possibilities if we expand the concept and include some of the other economic flows. This is going to be an eat or be eaten rule world.

We can build desk top journal delivery today. It's available from people like J-STOR. It provides better services to the readers. They can get things on their desk. They don't have to walk all the way to the library. But it doesn't produce any more money for the libraries. I have yet to see the government department that says, "You've given us American Economic Review on-line on our desktop. We'll vote for more money for the library next go around." Web publishing. We can get professor's papers out into the world faster, but by a system that doesn't involve any payment so that somebody can support it. We can do long distance cooperation. I've heard a lot this morning about long distance cooperation, but it breaks down the individual loyalties within the universities. We can scan the contents of book stacks cheaper than we can build central campus libraries, but there's no university I know where those are exchangeable kinds of money--library operations and a building. We can also imagine, from the competitive side, that publishers will say, "Wait a minute. There's a big market for undergraduate text books. I'll deliver those straight to the students, bypass the library." Well, then what's left in the library that needs existence? So, I summarize with this line from Yogi Berra that what we have here is an insurmountable opportunity. What? Pogo? sorry. I saw it from Yogi Berra. Maybe he was quoting Pogo.

Dean Farrington this morning talked about 15th century people who didn't believe in printing, and as it happens I have some quotes with me by people who said the world has gotten along without printing; printing will never be as good as copying out by hand. And in fact, 2,000 years before these people, Socrates told his people that writing was just a crutch for memory, that, you know, people who wrote things down instead of memorized them would have the show of wisdom but not the reality. A free reprint of a Scientific American article on the Internet to any of the classicists in the audience if you could tell me which Platonic dialogue.

Audience Member: Phaedrus.

Lesk: Right.

Audience Member: It was written down.

Lesk: Here's another one. This is a Wall Street Journal ad from a British building materials conglomerate named Hanson. It's got a picture of a chip and it says, "This is the latest thing in microprocessors. In 12 months it will be obsolete. This is a brick. In 12 months it will still be the latest thing in bricks. We're Hanson. We invest in things of permanent value." So there's a lot of luddites And of course, we've seen some of the promises of new libraries before. But on the other hand there we're also making a lot of progress. This is the drawing Life magazine made for Vannavar Bush's memex in 1945, the viewing screens and microfilm reels and the little chemical processing plant to create new microfilm, all inside the disk.

But more seriously about the progress. This is a chart of the number of bytes of text on the Web. It's on a log scale. It's going up a factor of ten every year. It's now at two terabits. The Library of Congress is normally thought to be about 20 terabits, which means basically next year there'll be as much text on the Web as there is in the Library of Congress. You may say, "Well, poo poo. I don't believe your numbers." So if I'm off by 12 months, you know, big deal. It will be another year. In terms, of course, of English material in the last two years, the Web is already ahead of the Library of Congress. In terms of evaluated material, well, that's another story.

Everybody uses libraries on-line. Let me ask a question. Every one of you, think of the last time you had to look up something. There was something you wanted to know and you didn't remember it. And I want to ask for a show of hands, looking it up on paper versus looking it up on the screen. How many on paper? How many on a screen? O.K., overwhelmingly for the screens. That's the way it is. As I've said, here's a lot of examples of researchers who use, depend critically, on the electronic libraries. The preprint physics server at Los Alamos by Paul Ginsparg. The protein in genome data bases. 4,000 books, all US legal decisions. Now the problem is there are still tenure committees who sort of have this attitude of "Well, what's a digital library?" A lot of people know the story of Greg Crane who got denied tenure with no consideration given to the Perseus work. But on the other side, there are undergraduates going around saying, "What's a paper library?" One MIT professor told me that he had to put a stipulation on his students: 10% of all the references in their papers had to be not URL's. And I was visiting Cornell and one of the professors shared that he complained to one of his undergraduates about no paper references, and the student threw his hands up and said, "I don't do libraries."

There's a lot of stuff in digital libraries. I don't have the demos that Silicon Graphics had, but I have my own things. O.K. We're getting into audio, into video, into images. Now you have to understand, images are, indeed, expensive. This is three bytes. This is 1,000 bytes. That's 12,000 bytes. Nevertheless, large image libraries are common. We don't know how to search them, but we're getting them and were going to be dependent on them. We get sound. For several years I ran a system where I like to listen to the news in my office, and I couldn't listen to it at the right time. So I plugged a radio into my work station in the US and I digitized national public radio all day long, and then I could listen to the news programs whenever I wanted to. I also got a friend in the UK to plug a radio into his computer, and it recorded Radio Four all day long. And I'd drag over the news program every morning, and I could listen to the BBC news if I had the time. And I could speed it up, and I could clip things out and I could save them. What?

Audience Member: Do you ever go back and listen to them?

Lesk: Yes! In fact, I still cite one of the results which I don't know a written source for, which somebody on NPR said that 10% of the expenses in the clothing industry were wages and 27% were information costs. And so, you know, better information was worth more than low wage labor. And I've never seen that anywhere else, so I still have to cite it.

There's also a lot of interest in maps. These are four maps of Cranford, NJ. I'm afraid perhaps only the people in the front can see this. This line is the same railroad in each map. This is 1878, the modern USGS quad spot satellite imagery aerial photographs. This stuff is now all digital and on the Web. This stuff is all digital and will be on NASA's site soon. This admittedly we still have to scan. The old maps-the Library of Congress has just announced they're going to be able to scan all of the Cranford Fire Insurance maps. This is an example of what you might do. There's a little lake here which is disappeared from all the modern representations. Presumably it got filled in. If you wanted to dig a big basement at that point in Cranford, you'd probably like to know about that.

This is one reason why the Congress is willing to support work on the Internet. This is the US balance of trade in data and information services. We run a ten to one positive balance of trade with the rest of the world and--anybody here ever logged into a Japanese information server? They log into ours all the time. That's why. So Congress thinks this is good. Now, in fact, let me try another one. How many of you have looked up something on the Web when you knew that it was available on paper in your local library? Good show of hands. How many of you have gone to your local library to look up something which you knew was on the Web? O.K. One. Not many. So we actually have a lot of enthusiasm for this, but how do we pay for it?

Here's some numbers. Library budgets. Typical in the United States. Library spends something on buildings, services, processing, acquisitions. Suppose it goes electronic. Well, you save some money on building. Everything else goes up. And no university lets you transfer building money to other services, so you're behind the eight ball. What about the publishers? These are numbers from the American Economic Association, courtesy of Malcolm Getz. They get 38% of their revenue from individual membership subscriptions. They only spend 23% on printing. So they look at these numbers and say, "If we switched to all electronic distribution, and the library copies serve the purpose within the university, we probably lose most of our membership subscriptions. We wouldn't save enough to compensate. We'd have to double the bill for libraries." That's not very attractive. So we don't know how to make that one work.

Here's another one we don't know how to make work. Here's the comparable cost of scanning books versus building libraries. J-STOR is paying 20 cents a page to scan books. That would be about $60 . The Cornell "CLASS" project paid about $30 per book scanned. The Core project was about seven cents a page. That would be about $21 per book. The Making of America has just put out a lot of scanning at eight and a half cents a page. I think the price could come down to $3 a book if we really did this on a large scale. J-STOR may show whether that's true. But we're talking about numbers from $20 to $30 a book.

What about building? Cornell built a new stack recently at $30 a book--or $20. Berkeley built one at $30. Now, admittedly, the Berkeley one, here it's yours, Peter, it's, you know it's built to withstand a Force A earthquake, but it is more expensive than it would have been to scan the books. I'm sorry what?

Audience Member: You scan it once and everybody can have it.

Lesk: Yes! That's the J-STOR principle. Share the scanning. J-STOR negotiated deals with it's things. I mean, you know, each of these libraries, the British Library and the Bibilotecque de France, cost far more to build than it would have cost to scan their content. In fact, somebody once told me back in the days of microfilm that if, instead of building the British Library, they had microfilmed every book in London, then taken all the books and put them in warehouses in North Wales, built a smaller building at the same price per square foot in London to hold the film, and then to deal with the complaints of the readers that the books were out in North Wales, given each reader a lifetime season ticket on British Rail to north Wales, they would still have saved 100 million pounds over what actually went on.

Now, there are a lot of problems. One of the things libraries worry about is permanence and preservation. And in fact, Don recently wrote, Don Waters, recently wrote a report on preservation, but let's see. How many people know the first telephone call? "Watson, come here, I need you!" We know that one. We know the first telegram. And these both are over a hundred years old. Anybody know the first e-mail? We do not know the first e-mail. We don't even know in which city it was sent. So we really have problems on preservation. We're going to have to learn that one. Now, what are some of the--

Audience Member: Just going to have to make it up!

Lesk: I mean one of my friends in the early '80's went looking for the first e-mail message, and even though it was less than 20 years, or about 20 years after the date, she could not pin it down. Nobody had accurate enough notes. All right.

What are some of the problems? Well, some of these have been already--quality. James Stafford says that Usenet is a herd of elephants with diarrhea. There was a comment this morning about did anyone teach these students how to evaluate things they found on the Web. It's not just students that have the problem. A number of American reporters have been writing stories about some town in Northern Ireland based on information they got of a Sinn Fein Web site without realizing that Sinn Fein had generated it.

Loyalty. I asked this question before. I went through one random journal, the only journal for which Bellcore still had 30 years on paper in it's library. I counted the number of papers which had been co-authored where all of the co-authors were from the same institution, and it ran about 30%. And now it's started to drop off. And I say this is the influence of Fax machines and e-mail. People can collaborate with anyone in the world.

Do we get shared experiences? Universities have been introducing things like core curriculum so that all the students will have something in common, and we heard about that from Morgan Friedman this afternoon. Well, what happens if everybody is out on the Web doing their own thing? The other side of that is can we preserve diversity? Is there a danger that if everybody is looking for the greatest multi-media presentation, that it won't be affordable to make many of them? Arn deWhiter, the head of research at Elsevier, once said to me, "Look," he said, "we publish a dozen elementary college physics textbooks. No problem. Do it all the time. We may be able to fund one really good CD ROM with animation and everything else. Now maybe you don't care if we only have one college physics CD ROM. Do you really want there to be only one American history CD ROM? One philosophy CD ROM? Maybe that's dangerous."

Equality of access. How do we get the Web out everywhere? You know, a trivial matter: eight percent of the men in this country are color blind. There are all sorts of things you'd like to make work, and a big thing is recognition. How do we reward people for what they do on the Web? Any of you work at places where the tenure committees actually value on-line publication? I guess the man from Glasgow, Professor Davies, can say that you do because the research assessment exercise requires you to. But it's not that common.

Finally, I listened to Doug Van Houweling talking about distance learning and we're going to help out these small universities, and I worry about the small university president who says, "Oh, I can't afford faculty members to teach Russian, and I don't have enough students who want it. I'll buy it from the University of Michigan." Well, next year it'll be math. The year after that it'll be--you know, why have a history-why have a library or a faculty at all? Let's just buy everything in from big universities. And that points--there was a comment by Susan Fuhrman that there wasn't much work on educational research. That's a very serious problem. There's very little good development of course where we're not studying it. In theory, self-paced instruction should be a big win. Everybody believes that self-paced instruction should produce an enormous saving. But the reality is we haven't achieved it. We can do math and language drill, and other than that we're looking at a lot of failures with programmed instruction and research, and then the balance of research--

And of course getting out to diversity. Actually, speaking of diversity, whole lot of discussion about gender access to computers. Anybody know what's the best selling CD ROM in the country? The best non-software CD ROM: Barbie fashion designer. First one. And all your universities have problems. There prevails today an extensive and wasteful competitive duplication of plant and personnel among American universities, particularly in the graduate schools. Thorsten Veblen in 1918. So anyway.

I want to wind up with telling you what should universities do. All right? O.K. One comment is, you want to use the Web. You all have university presses. Well, some of you have probably abolished them recently. The Web is the alternative. You have to get out there and say, "Look. We're going to have-we're going to make the University of Pennsylvania Web page a place that people are proud to publish. Maybe we'll have all sorts of student Web pages with people giving SEPTA schedules or something like that, but in addition we're going to have some area which will be as prestigious as the university press and can economically survive when the university press can't." And that means giving awards, encouraging people to developing tools, and encouraging bonding through the--we've heard some wonderful stories from Myra Lotto about students bonding through local coursework. We need more stories like that. We're going to find that students won't care where they were physically located. In fact, maybe professors won't care if you can get the whole Stanford library on-line, maybe you can do all the research without being at Stanford. And it's much easier to get jobs at other universities.

We need to teach people how to find and evaluate things. The Web does have this huge pile of junk. You do need some skills. We have a tradition in libraries of teaching people about imprints and binding and, you know, where did something come from. We need the Web equivalent. So how do you look at something and decide whether it's any good? And you're not going to be able rely on whether it's on decent paper or newsprint.

And another thing I think we don't do enough of is we don't support the recognition of new forms of creativity. We still have an academic system in which the written English essay is all important, and people in art and music have been complaining for generations that their paintings, their compositions didn't get equal attention. Well, this is now going to come with a vengeance on software, on what do we do to say to students, "If you do really exciting art or music--if you develop a good tool that lets people who are not professional musicians do something useful with music--" this is something that should be heavily rewarded.

Also collaboration techniques. We don't have a large supply of people who are good at both writing and art and music. You know, as I think back through history, well we've got Blake and we've got Rossetti and maybe you can come up with one or two more but it's not common. But how do we see that we can encourage people to collaborate, so that a student who knows art and a student who--a student who can draw and a student who can write can work together? And I think we need to know how to do this and how to reward it, and the universities have to work in this area or you won't make it. Other people will come in.

O.K. That's my argument as to what should be one.



Contact: heia@pobox.upenn.edu
URL: http://www.upenn.edu/heia/proceed/present/lesktrans.html
Last modified: 26 January 1998