William Beutler on Wikipedia

Archive for April 2012

Disambiguate This!

Tagged as , , , , ,
on April 17, 2012 at 1:05 pm

If the Wikipedia article titled “Wikipedia in culture” is to be believed, the free, online encyclopedia’s primary contribution to popular culture is as a humorous reference, particularly in U.S. cable television programming.

Topic-wise, sometimes the joke relates to Wikipedia’s uneasy relationship to education, including T-shirts featuring leaping graduates thanking Wikipedia. More often than not, Wikipedia’s uneven reliability is the joke, such as The Onion’s classic 2006 article: “Wikipedia Celebrates 750 Years Of American Independence”.

If it has had any noticeable linguistic impact (aside from debate over the meaning of “Santorum”) it is probably in the phrase “Citation needed”. But the word that I wish Wikipedia could popularize is:

Disambiguation

It’s a perfectly cromulent word, and can be found in the dictionary (or at least on Dictionary.com), apparently dating to the 1960s, and unsurprisingly means:

to remove the ambiguity from; make unambiguous

And yet it’s not a word that I can recall having seen prior to Wikipedia, even though I have a degree in English and very nearly earned one in journalism. In a world of ambiguity, what more could we want than disambiguation to help us understand what’s real, and what matters? Well, maybe therein lies the problem: there are no easy diambiguations in the real world. But are they so easy, even on Wikipedia?

If you don’t know what disambiguation is, it’s pretty simple. Wikipedia has articles about many people named John Smith, most real and even some fictional. So many, I’m not even going to bother counting. Because no John Smith is considered vastly more famous than the other, none of them gets this URL:

Nope, that’s the disambiguation page, where one can find, among many others:

And, for fans of The A-Team, there is also:

In many cases, a word will have one primary meaning, and then multiple secondary uses. This is when the parenthetical expression “(disambiguation)” comes in. One example:

Typically, articles requiring some form of disambiguation require a “disambig” note at the top of the page (called a “hatnote”). Frequently, the phrasing is “Not to be confused with…” and here is one example, which I enjoy more than most:

McGraw-Hill disambiguation

As you may expect, there is a lengthy guideline detailing how disambiguation pages are to be governed. But on a website where not everyone knows the rules, nor does everyone agree about the relative importance of similarly-named subjects, there can be some glitches. This is especially true when one is being implored by unknown advisers “not to be confused by” a deceptively unrelated topic.

One errant disambiguation comes to mind immediately, because I’m the one who undid it.

First, Bob Dole should well-known to any American over the age of 25, if not for being the Republican presidential nominee in 1996, then perhaps for that one Pepsi ad with Britney Spears. Meanwhile, Robert Dold is a U.S. congressman from Illinois, whom I had never heard of until very recently, although I live in DC and have worked in and around U.S. politics for a decade. (Dold has only been in Washington since 2010, so there’s that.)

Then what explains the admonition not to confuse this:

With this:

Yeah, I didn’t get it either. So I removed the unnecessary disambiguation from Dole’s page, and I seriously doubt anyone has been wondering “What about Bob (Dold)?

There are other interesting unbalances, however often more justified. As I recently tweeted:

Joe Plummer vs. Joe the Plumber on Wikipedia

Indeed, compare this:

With this:

But I’m sure that’s right. Joe the Plumber is far better known, following his stint as the semi-official mascot of John McCain’s 2008 presidential campaign, than is Joe Plummer, who is probably a swell guy and earns bonus points from me for being from Portland. And with Mr. the Plumber now the Republican nominee to challenge Rep. Marcy Kaptur this fall, it’s looking even dimmer. Sorry, Joe (the Plummer).

But in the world of interesting disambiguations, undoubtedly this one is my favorite:

At least it doesn’t tell you to not to be confused.

The Agony and Ecstasy of Wikidata

Tagged as , , , , , , , , ,
on April 12, 2012 at 8:31 am

Although Wikipedia is by far the best-known of the Wikimedia collaborative projects, it is just one of many. Just this last week, Wikimedia Deutschland announced its latest contribution: Wikidata (also @Wikidata, and see this interview in the Wikipedia Signpost). Still under development, its temporary homepage announces:

Wikidata aims to create a free knowledge base about the world that can be read and edited by humans and machines alike. It will provide data in all the languages of the Wikimedia projects, and allow for the central access to data in a similar vein as Wikimedia Commons does for multimedia files. Wikidata is proposed as a new Wikimedia hosted and maintained project.

Possible Wikidata logo

One of a few Wikidata logos under consideration.

Upon its announcement, I tweeted my initial impression, that it sounded like Wikipedia’s answer to Wolfram Alpha, the commercial “answer engine” created by Stephen Wolfram in 2009. It seems to partly be that but also more, and its apparent ambition—not to mention the speculation surrounding it—is causing a stir.

Already touted by TechCrunch as “Wikipedia’s next big thing” (incorrectly identifying Wikipedia as its primary driver, I pedantically note), Wikidata will create a central database for the countless numbers, statistics and figures currently found in Wikipedia’s articles. The centralized collection of data will allow for quick updates and uniformity of statistical information across Wikipedia.

Currently when new information replaces old, as is the case with census surveys, elections results and quarterly reports are published, Wikipedians must manually update the old data in all the articles in which it appears, across every language. Wikidata would create the possibility for a quick computer led update to replace all out of date information. Additionally, it is expected that Wikidata will allow visitors to search and access information in a less labor-intensive method. As TechCrunch suggests:

Wikidata will also enable users to ask different types of questions, like which of the world’s ten largest cities have a female mayor?, for example. Queries like this are today answered by user-created Wikipedia Lists – that is, manually created structured answers. Wikidata, on the hand, will be able to create these lists automatically.

Though this project—which is funded by the Allen Institute for Artificial Intelligence, the Gordon and Betty Moore Foundation, and Google—is expected to take about a year to develop, but the blogosphere is already buzzing.

It’s probably fair to say that the overall response has been very positive. In a long post summarizing Wikidata’s aims, Yahoo! Labs researcher Nicolas Torzec identifies himself as one who excitedly awaits the changes Wikidata promises:

By providing and integrating Wikipedia with one common source of structured data that anyone can edit and use, Wikidata should enable higher consistency and quality within Wikipedia articles, increase the availability of information in and across Wikipedias, and decrease the maintenance effort for the editors working on Wikipedia. At the same time, it will also enable new types of Wikipedia pages and applications, including dynamically-generated timelines, maps, and charts; automatically-generated lists and aggregates; semantic search; light question & answering; etc. And because all these data will be available as Open Data in a machine-readable form, they will also benefit thrid-party [sic] knowledge-based projects at large Web companies such as Google, Bing, Facebook and Yahoo!, as well as at smaller Web startups…

Asked for comment by CNet, Andrew Lih, author of The Wikipedia Revolution, called it a “logical progression” for Wikipedia, even as he worries that Wikidata will drive away Wikipedians who are less tech-savvy, as it complicates the way in which information is recorded.

Also cautious is SEO blogger Pat Marcello, who warns that human error is still a very real possibility. She writes:

Wikidata is going to be just like Wikipedia in that it will be UGC (user-generated content) in many instances. So, how reliable will it be? I mean, when I write something — anything from a blog post to a book, I want the data I use in that work to be 100% accurate. I fear that just as with Wikipedia, the information you get may not be 100%, and with the volume of data they plan to include, there’s no way to vette [sic] all of the information.

Fair enough, but of course the upside is that corrections can be easily made. If one already uses Wikipedia, this tradeoff is very familiar.

The most critical voice so far is Mark Graham, an English geographer (and a fellow participant in the January 2010 WikiWars conference) who published “The Problem with Wikidata” on The Atlantic’s website this week:

This is a highly significant and hugely important change to the ways that Wikipedia works. Until now, the Wikipedia community has never attempted any sort of consistency across all languages. …

It is important that different communities are able to create and reproduce different truths and worldviews. And while certain truths are universal (Tokyo is described as a capital city in every language version that includes an article about Japan), others are more messy and unclear (e.g. should the population of Israel include occupied and contested territories?).

The reason that Wikidata marks such a significant moment in Wikipedia’s history is the fact that it eliminates some of the scope for culturally contingent representations of places, processes, people, and events. However, even more concerning is that fact that this sort of congealed and structured knowledge is unlikely to reflect the opinions and beliefs of traditionally marginalized groups.

The comments on the article are interesting, with some voices sharing Graham’s concerns, while others argue his concerns are overstated:

While there are exceptions, most of the information (and bias) in Wikipedia articles is contained within the prose and will be unaffected by Wikidata. … It’s quite possible that Wikidata will initially provide a lopsided database with a heavy emphasis on the developed world. But Wikipedia’s increasing focus on globalization and the tremendous potential of the open editing model make it one of the best candidates for mitigating that factor within the Semantic Web.

Wikimedia and Wikipedia’s slant toward the North, the West, and English speakers are well-covered in Wikipedia’s own list of its systemic biases, and Wikidata can’t help but face the same challenges. Meanwhile, another commenter argued:

The sky is falling! Or not, take your pick. Other commenters have made more informed posts than this, but does Wikidata’s existence force Wikipedia to use it? Probably not. … But if Wikidata has a graph of the Israel boundary–even multiple graphs–I suppose that the various Wikipedia authors could use one, or several, or none and make their own…which might get edited by someone else.

Under the canny (partial) title of “Who Will Be Mostly Right … ?” on the blog Data Liberate, Richard Wallis writes:

I share some of [Graham's] concerns, but also draw comfort from some of the things Denny said in Berlin – “WikiData will not define the truth, it will collect the references to the data…. WikiData created articles on a topic will point to the relevant Wikipedia articles in all languages.” They obviously intend to capture facts described in different languages, the question is will they also preserve the local differences in assertion. In a world where we still can not totally agree on the height of our tallest mountain, we must be able to take account of and report differences of opinion.

Evidence that those behind Wikidata have anticipated a response similar to Graham’s can be found on the blog Too Big to Know where technologist David Weinberger shared a snippet of an IRC chat with he had with a Wikimedian:

[11:29] hi. I’m very interested in wikidata and am trying to write a brief blog post, and have a n00b question.
[11:29] go ahead!
[11:30] When there’s disagreement about a fact, will there be a discussion page where the differences can be worked through in public?
[11:30] two-fold answer
[11:30] 1. there will be a discussion page, yes
[11:31] 2. every fact can always have references accompanying it. so it is not about “does berlin really have 3.5 mio people” but about “does source X say that berlin has 3.5 mio people”
[11:31] wikidata is not about truth
[11:31] but about referenceable facts

The compiled phrase “Wikidata is not about truth, but about referenceable facts” is an intentional echo of Wikipedia’s oft-debated but longstanding allegiance to “verifiability, not truth”. Unsurprisingly, this familiar debate is playing itself out around Wikidata already.

Thanks for research assistance to Morgan Wehling.

Public Lives: Jim Hawkins and Wikipedia’s Privacy Dilemma

Tagged as , , ,
on April 6, 2012 at 9:15 am

Editor’s note: The author of this blog post is Rhiannon Ruff (User:Grisette), a friend and colleague, in what I hope is a continuing series. The Wikipedian published a previous guest blog post in December 2011.

Introduction to Jim Hawkins Wikipedia article.

As an occasional Wikipedian, I like to check out Jimmy Wales’ user Talk page every now and again; while user Talk pages are generally where editors leave messages for each other, notes of support, or even warnings, Jimbo Wales’ page is a hot-bed of intrigue, gossip and debate. It’s Wikipedia’s water cooler. And it’s the perfect place to go if you’re looking to find an example of the confusion that can result from the occasional collision of hot-headed editors, complex guidelines and individuals who are themselves the subjects of articles. Just today I came across a discussion that mentioned Jim Hawkins, a radio-presenter in the UK who has been struggling to deal with Wikipedia editors, and Jimmy himself, over privacy issues raised by his biographical article.

Contrary to what many people believe, the Wikipedia community and Wikimedia Foundation are very keen to protect individuals’ privacy. There’s a common misunderstanding that if you edit Wikipedia, anyone can find out who you are—an idea proliferated by media coverage of incidents where editors’ IP addresses were traced and companies outed for editing their own articles (or, worse, those of competitors). But there’s actually a simple solution: creating an account on the site hides your IP address when you edit. And as long as you only edit while logged into that account, there’s no way for anyone to find out who or where you are through your IP. There are also very strong rules against “outing” the real life identities of editors by posting their personal information on the site.

But what if you’re the subject of a Wikipedia article? Getting back to Jim Hawkins, here’s the real dilemma that people in the public eye are faced with: anyone can create an article about them, but how do they go about preventing their personal details from being included in it? Hawkins certainly wasn’t happy about the creation of an article about him, and he was even less impressed that it included details such as the county where he lives and his exact birthdate. He’s been trying to get the article deleted for five years now. Over time, his frustration in dealing with the Wikipedia community has led to increasing antagonism on both sides.

After a recent “edit war” where his birthdate was repeatedly added and removed, the date was removed once and for all after an official request was made on behalf of Hawkins. The edit was made in line with a privacy policy that allows subjects of biographical articles to request the removal of their date of birth from the site. But, the county remained and Hawkins continued to rail against the system on the article’s Talk page:

Why should the people who’ve been stalking, bullying and harassing me – and have been doing so again today! – have any say in what happens to the article?
Hooray for policies. Does common human decency come into this anywhere? Or am I going to get the same response I’ve had for five years, the borderline-fundamentalist ‘that’s not how Wikipedia works’?

In a lively discussion on Jimmy Wales’ User Talk page beginning on April 1, editors were divided over two issues:

  1. Should an individual who is on the cusp of notability (i.e. just about eligible for a Wikipedia article, according to guidelines) be allowed to choose whether or not they have an article?
  2. If personal information about a subject has been published in public sources, does it contravene Wikipedia’s privacy rules to include it in the article?

There’s no simple answer to either of these. The first one in particular is really rather tricky. It’s true that if an article about someone hasn’t been created, there’s nothing that says that it has to exist. If an article has been created, though, it isn’t clear whether there should be the option to delete if the subject isn’t very strongly notable. Wikipedians seem to fall into two roughly two camps on the issue: those with sympathy towards article subjects and those who are concerned with ensuring that information is available on Wikipedia, if sources exist to support it.

The main question that Hawkins raised was why there had to be an article about him, if he felt that it was unnecessary, inaccurate and infringed upon his privacy. At one point in discussion he asks:

Can I point out that the whole damn thing is an invasion of privacy?

And an experienced editor replies, summarising the crux of the issue here:

An invasion of privacy is, by definition, the release of private information. This information, however, is not private, but is stated by the subject in the very show he hosts.

So, the issue is: if information exists in the public sphere, why should it not be included in a Wikipedia article? The details are already out there, some editors argue, so adding it to a Wikipedia article can’t be infringing on the subject’s privacy as the information wasn’t private to begin with. The bright line that exists on Wikipedia is its governing principle of verifiability: information included in articles must always be verifiable, that is, they must be supported by reliable sources. So, if personal information about a subject isn’t supported by a reliable source—even if it’s true—it can’t be included. Unfortunately, as Hawkins has discovered, if the information does appear in a reliable source (in this case, in a local magazine and on the BBC website), whether it is included or not comes down largely to editors’ discretion.

In short, the lesson Jim Hawkins has learned the hard way is: if you don’t want something included in your Wikipedia article, make sure it isn’t published in the first place.