Friday, August 15, 2008

Yak Shaving

Ok, so back in June I had an idea for a fun web app. I realized that there were lots of articles in Wikipedia tagged as "needing a photo". Also, I realized that there were lots of articles in Wikipedia tagged with geo-tags (example). It seemed like a straightforward and useful thing to mash these two lists together and find out things that are near a particular location that need photos taken of them. That might make a fun way to spend a weekend, or find interesting things when vacationing. Oh, and it would be neat to slap some Fire Eagle on there since it's Teh Hotness.

So, I decided to try and put something together.

The first step is figuring out a programmatic way of getting these two lists of articles, so that we can compare them. This is our first step down the rabbit hole.

Geonames


There isn't a direct way to query Wikipedia for articles near a geographical location, but luckily there is a service called Geonames that allows to do exactly this. They let us form a url like this:

http://ws.geonames.org/findNearbyWikipedia?lat=39.10982894895&lng=-84.48848724365&radius=15

that returns articles near a certain latitude and longitude within the given radius. Nice.

Few problems though:

1) There is more than one way to geo-tag a Wikipedia article and geonames only recognizes one of these kinds of tags. No worries, though, right? We can just go in and convert the ones with the bum tags over to the nice kind of tags. Editing Wikipedia for fun and profit! Sure, you could do that, except...

2) Geonames only crawls Wikipedia every so often and stores the geodata in it's own database. So, you're not querying the live Wikipedia. You're just getting whatever geonames has and that could be months old. Also, there's no predictable schedule for how often geonames will re-crawl Wikipedia.

But it does have pretty many articles, so we'll live with it for now.

Wikipedia articles needing photos


Now, we need to get a list of articles that need photos to compare with our list of articles that are near us. Again we discover the unfortunate trend of there being more than one way to do things in Wikipedia.

For starters, there are two major ways to mark that an article needs a photo.

A) Use the 'reqphoto' tag. Using this tag by itself marks the articles as needing a photo. Adding the 'in=' parameter, specifies that the thing is 'in' a particular location. And here's where it gets crazy: You can put multiple locations with 'in=somewhere', 'in2=anotherplace', 'in3=athirdpace' and so on. For each place that the article is 'in', Wikipedia automatically adds it to a category called "Articles needing photos in place". This can become messy.

B) The other way to mark an article as needing a photo is to use a WikiProject tag from a WikiProject that is based on a place and set the 'needs-photo' tag to 'yes'. For instance the WikiProject Cincinnati tag. This will add the article to a category called "Cincinnati articles needing photos". Some of these will be articles related to Cincinnati, like a person or a company, but not really located in one specific place. There are WikiProjects for many major cities and states, but not all and some don't use this feature.

Some articles use the reqphoto tag and others use the WikiProject tag. Some use both.

One problem with the reqphoto tag using the 'in' parameter is that there are no good guidelines as to what types of values 'in' should be. Many articles have 'in=Ohio'. Some just have 'in=Cincinnati'. Some have 'in=Cincinnati, Ohio'. Better yet, some have 'in=Cincinnati, Ohio, in2=Ohio'. In other states people have organized all reqphoto tags by county. So they'd have 'in=Cook County, Illinois'. I haven't been able to find a consensus on what is the best way to use this tag. Sometimes it is used for things that move, like people. Sometimes it is used for things that cross borders, like roads or mountains.

What that means is if you are in Cincinnati, and you want to check to see what articles need photos, there are potentially 5 categories where the article might appear:

Cincinnati articles needing photographs
Wikipedia requested photographs in Cincinnati
Wikipedia requested photographs in Cincinnati, Ohio
Wikipedia requested photographs in Hamilton County, Ohio
Wikipedia requested photographs in Ohio

And that's not to mention things that got categorized into other municipalities, like say Norwood or Springdale. Those won't appear in any of the above lists. Also there are the articles that have a reqphoto tag but no 'in' parameter at all.

There are many ways to plow through all this and I tried to plot them all out here:



It's a mess. But it's a mess I understand at this point. Getting a complete answer is not a simple as finding two lists and getting the intersection. The real problem is with that area labeled "http request x N".

See, I could get all the nearby articles, and then check each one of them individually for the reqphoto tag. But that would require making N additional http requests, where N is the number of nearby articles.

Or, I could check all the reqphoto categories using city, county, and state info taken from Fire Eagle (that'd be a fixed number of http requests) and then for each article needing a photo, see if it has a geo-tag (which is a lot of additional http requests). Also, this method leaves out an article that has been geo-tagged and has the reqphoto tag, but no 'in' parameter.

There's no way to make a small, fixed number of requests and have all the information to make this app really useful. That is sucky.

It turns out Fire Eagle is actually the easiest part. The only problem is that it gives you city names like "Cincinnati, OH" which then have to be converted to either "Cincinnati" or "Cincinnati, Ohio" to match up to names of categories in Wikipedia.

If you've got Fire Eagle powers, then you can try this version. If not, try this one or this one. That's about as far as I've gotten.

The last bit, which I leave up to the user to figure out, is obviously: uploading the photo and adding it to Wikipedia. I believe I could eventually set up a bot to take photo submissions and add them to the articles on behalf of the user, but that's a whole 'nother can of worms.

See also: yak shaving