I'd been mulling over what to write in response to the talk Eric Miller gave to a small crowd here at UMW -- it was quite a treat. Fortunately, Gardner beat me to the punch with this post in response, so I'll go with responding to Gardner.
Gardner and I have spent many happy hours talking about the semantic web. Someday, I hope to win him over ;). I think that lurking in some of the hesitation about the semantic web is a perceived distinction between things social and serendipitous and things structured and formal. I think that I'm guilty of either overemphasizing the idea of structure or of badly misrepresenting how I think about structure. Probably both. Gardner's question is a good place for me to give it another stab. He writes:
I do continue to have questions about the idea of the “semantic web,” particularly as it seems to me to downplay the semantic energies of the document in favor of the clearer and more specifiable semantic energies of data. My training in the humanities locates meaning in documents, at least in the sense that documents are the things that make the case for meaning, and invite a response to meaning. Data, by contrast, is measurement and observation.
Documents and data are the heart of it:
I asked Dr. Miller to specify the distinction between “document” and “data,” and he replied “in the eye of the beholder.”
It was a particularly timely question--the same day there was a discussion on the DBpedia email list about what, exactly, is a 'document' anyway. By one (philosophical?) definition, people count as documents: "anything which can convey information to an observer(Buckland 1997)" (see here)
I'll start by saying that "data" is being used with a broader meaning than "measurement and observation." At the risk of pushing toward the philosophical, "data" means, I think, a collection of statements made about something. (The veracity of those statements is stickier, and at the heart of trust issues on the web. That's an issue for another time). It's that broader definition that makes the distinction between 'document' and 'data' a bit blurrier. A book is doubtless a collection of statements. It also is doubtless much more than that. The writer, reader, context, etc. bring meaning out of that collection of statements. Humans make meaning, computers don't, and so the semantic web (at least as I envision it) isn't about meaning-making. Humans will always be the heart and soul of that.
The semantic web will, however, help humans in their meaning-making mission. Here's how, and one way at it is to start with data about documents we're not only familiar with, but depend heavily on as scholars. A good, scholarly book typically includes two collections of metadata about it: an index and a bibliography. Broadly speaking, these can both be called an "index" ☞ a pointer ☜ to resources. They both help scholars by guiding them toward particular items that they want and need in their work. They are lists of statements about the book, data. It'd be hard to deny the value of these collections of data (or are they subdocuments within the document of the book?)
Now, moving on into the semantic web, and especially the idea of gleaning semwebby data from the existing documents, imagine all the other potentially useful statements lurking implicitly in the book, just waiting to be made explicit. Say its a book about literature, maybe about "Paradise Lost." What passages are quoted, and on what page? What secondary sources are quoted most frequently? What secondary sources are quoted most extensively (i.e., quotations from throughout the source, rather than just one chapter of it)? What sources roughly contemporary with Paradise Lost are cited, and on what pages? If everything is properly cited, indexed, and included in the bibliography, a computer can figure out what it needs to provide answers to these questions, but there's no way a human could provide this much indexing.
Just like the index and the bibliography, the answers to those questions basically comes down to a collection of statements, data. Data that's clearly a part of the document, but could conceivably be separated out: a symbiosis between data and document.
Why would this be useful? Start again with the familiar. The bibliography is an essential research tool. The first thing my dissertation director had me do was "go read everything; read the bibliographies first; report back in six months." Read enough bibliographies, and you've got a collection of who's studying what. ("Keeping current with the scholarship in the field," I think it used to be called when it was still a realistic expectation.)
Now, moving again into the possibilities for the future. Imagine that publications automatically get scraped so that answers to questions like the ones above can be answered. (That is, imagine that all that implicit data gets rdf-ized.) Imagine further that web 2.0 content is similarly being exposed as rdf. And it's all available through semantic web queries. With that, as a scholar I could do the following sorts of searches in addition to what we now have available through document-centered approaches:
- What books quote Paradise Lost Book V, lines 132-150?
- What blogs quote Paradise Lost Book V, lines 132-150?
- Who's quoting my scholarship (basically, trackbacks for print documents)?
- What sources are most frequently and extensively cited in scholarship about Paradise Lost?
So, what's this all about? Every document has lots of data in it. The data can be gleaned, and the document remains a document as it always has been, no loss of meaning or anything. But with that data rdf-ized out, more complex, varied, powerful, and intricate links between documents can be exposed. And that helps researchers do their (human) job: make meaning from many sources.
nice blog - makes for a good recap of Semantic Web ideas to read over before my presentation. :o)
Post new comment