New UMW Data up

Project(s): 

So I've spent the last several weeks doing, in essence, a complete rewrite of the scripts that scrape in data from UMWBlogs. It's all now much more modular, which I hope will make it much more nimble to expand out into new data sets. The first priority will be grabbing feeds from a wider variety of sources. Then, it'll be into tapping into linked open data sources.

The upshot of the rewrite is to produce a chain of classes focused on bringing in data from particular sources. It's all still built on ARC, but with classes built around it with particular attention. There's a generalized class for ingesting data into the data store (the language of 'ingesting' is borrowed from repository apps like Fedora). But in practice it's all about the more particular classes. So, for example, there's a class focused on scraping the data out of the content of a post, which goes into a big DOMDocument. It grabs out the <a> nodes, and has a class designed to deal with those. Similarly for <img> nodes, <embed> nodes, and tags that are associated with the post. I'm using SimplePie as the starting point for parsing the feed, so all that falls into place pretty quickly.

Following the hierarchy of the feed, the chain runs through sub-ingestors. So the instantiation of a class that deals with a SimplePie FeedItem instantiates the class for <a> nodes, <img> nodes, etc. as needed. Then they all get dumped into the triplestore from the top down.

I'm now closing in on about 70,000 triples, with more being added every day as I scrape out more from the feeds.

Good starting points for it are at the list of Directories, Galleries, and Exhibits at my new other, non-technical blog, Semantic UMW

Trackback URL for this post:

http://www.patrickgmj.net/trackback/144

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

More information about formatting options