Call for testers – NLP for humanities

We are creating a web-based tool that aims to make some state-of-the-art Natural Language Processing tools more accessible to humanities scholars.

We are recruiting a few testers for the project; we can offer modest stipends (exact amount TBD, but in the range of $150-$300) to compensate you for your time. The testers are expected to attend one or two sessions of 1-2 hours each at CU Boulder and also possibly to provide a bit of written feedback. The total time commitment is around 7-10 hours, depending on how much time you want to invest in playing with the tool and providing feedback.

If you are interested, please get in touch by March 13th. We hope to hold the first session soon, but definitely before April 15, and the second session before May 3.

The tool and testing it

All you need is a laptop (Windows, Mac, or Linux) that has at least 2.5 GB free space.

The tool runs on your computer. It is installed in a Docker container – therefore, users will need to install Docker (we have instructions for that and it’s not particularly challenging.)

The interface runs in a browser. The data processed by the tool is put in a directory on the user’s computer and analyzed from there. We have sample data that you’ll be able to play with, but once you have the tool installed, you can also try it out with your own data.

And what’s the tool for? Well, you can extract named entities, examine what entities appear as doers/done-to or subjects/objects, how that changes over time, what verbs go with what entities, and so on. (Some of these are not implemented yet, but hopefully will be.)


If your question now is, what are named entities, and what do you mean by doers/done-to, and all that, read on for a mini-primer on NLP!

A bit about NLP

Natural Language Processing (NLP) is analyzing language with a computer. It has lots of different areas and applications. The ones particularly relevant for our tool are named entity recognition (NER), the analysis of syntactic roles, semantic role labeling (SRL), and constituency and dependency parsing.

Named entity recognition (NER)

Named entities are simply identifiable entities in text: people, locations, organizations, and so on. Named entity recognition processes a text so as to identify such entities in it; they can then be used in various kinds of analyses. Below are a couple of nicely colored examples from the AllenNLP demo (our tool is built on the AllenNLP library.)

If you like (Paul McCartney /PERSON) you should listen to the (first /ORDINAL) (Wings /ORG) album.

When I told (John /PERSON) that I wanted to move to (Alaska /GPE), he warned me that I'd have trouble finding a (Starbucks /ORG) there.

(The GPE in the second example stands for Geographic/Social/Political Entity.)

Having marked the entities, we can start thinking about relationships between them.

Syntactic role labeling

Labeling syntactic roles makes use of Part-Of-Speech (POS) tagging – marking things like “noun phrase,” “noun,” “verb,” “adjective,” and so on. And then, as you may recall from some distant school lesson, some nouns in sentences are labeled “subjects” (doers of some action) and some are labeled “objects” (that something is done to.) Objects can be “direct” or “indirect.”

So in the example below, “Mary gave the book to John,” Mary is the subject, the book is the direct object, and John is the indirect object.

Mary gave the book to John (annotated)

Semantic Role Labeling (SRL)

Sometimes, though, the syntactic “subject” does not quite capture what we mean that someone is the “doer” of an action – that someone or something is behaving like an “agent.” In the examples below (‘John broke the window’ and ‘the window broke’) , “window” is the direct object in one and the subject in the other – but the same thing happens, i.e., at the end of the sentence, the window is broken.

So we have another system that tries to capture that semantic idea better – and, unsurprisingly, it’s called Semantic Role Labeling. What we want is for the window to be the “same thing” in both examples in a way that makes semantic sense.

See? Now “the window” is something called “Arg1” in both cases. Instead of subjects and objects, Semantic Role Labeling identifies something known as Agents (also known as Arg0) and Patients or Themes (also known as Arg1). The Agents are the doers, the Patients the done-to. (I’m cutting corners here – the reason for multiple terms is that there are multiple ways of identifying these, but there are other places where you can read up on the details; this is the quick-and-dirty version.)

Semantic Role Labeling is not a perfect match for how we might intuitively think about “doers” and “done-to.” For example, look at the labeling of the sentences below.

usbombed
The U.S. bombed South Vietnam.
svsuffered
South Vietnam suffered because of U.S. bombing.

Here too, one feels that “South Vietnam” should be in the same role in both cases, but it’s not. The sentence is marked up correctly according to the system’s own lights; in its view, “suffering” is a case of “experiencing,” and “experiences” are agent-like. A question of semantics, methinks. In any case, in other sentences the match is better:

svunderwent
South Vietnam underwent suffering because of U.S. bombing.

Now “South Vietnam” is Arg1, i.e., the patient, i.e., the done-to. So, some nice portion of the time things work out as expected.

There are lots of different proposals for “roles” that one could mark sentences with. Also, the roles depend on the verb and the sense of the verb. The implementation that we are using relies on Proposition Bank (PropBank.) If you want to get a better insight into how different entities are marked with different roles depending on the verb, you can search for verbs here: https://verbs.colorado.edu/verb-index/search.php and then click on the verb next to PropBank (see example for “give” below.)

Screen Shot 2019-04-14 at 4.12.26 PM

 

Who cares, anyway?

NLP is used in all kinds of applications. Many of them have to do with things like question answering (which company’s stock went up last week? who bought stock in company X?) and various kinds of information extraction.

For the humanities, it might not necessarily be those kinds of questions that are the most interesting. Instead, a humanities scholar might be interested in seeing what kinds of relationships two entities have in a body of texts, or what kinds of entities or words are related, what connotations different entities/words have, which entities appear in “agent” positions, and so on.

For example, what kinds of words are “immigrants” linked to in newspaper stories, and what verbs link them to those words? It’s one thing if immigrants are consistently linked to “the economy” by verbs like “contribute,” and another thing entirely if they are linked to “jobs” by verbs like “take.”  What adjectives modify “immigrant” and related words? In stories about immigrants, who appears in “agent” positions–immigrants, officials, someone else? And so on.

Blog at WordPress.com.

Up ↑

%d bloggers like this: