Reputation: 3460
Basically I am writing a Java module that is supposed to take English text and switch the genders of the pronouns. So for example, if you give it "She put the box on the table" it would give you back "He put the box on the table." If you gave it "His feet hurt" it would give you back "Her feet hurt."
This is pretty much easy, except for the word "hers." Sometimes his = her, sometimes his = hers.
I've been looking into NLP, which I know pretty much nothing about, and I tried out OpenNLP but it's failing me (I can't use the Standford NLP because of the licensing issue). The POS tagger and the Chunker get confused with her/hers, and so even does the parser. So for example:
The box was his.
(TOP (S (NP (DT The) (NN box)) (VP (VBD was) (NP (PRP$ his))) (. .)))
The box was hers.
(TOP (S (NP (DT The) (NN box)) (VP (VBD was) (ADJP (JJ hers))) (. .)))
The box was his box.
(TOP (S (NP (DT The) (NN box)) (VP (VBD was) (NP (PRP$ his) (NN box))) (. .)))
The box was her box.
(TOP (S (NP (DT The) (NN box)) (VP (VBD was) (NP (PRP$ her) (NN box))) (. .)))
It correctly identifies "hers" as an adjective phrase, but when "his" is used in the predicate in the exact same way, it incorrectly identifies it as a possessive pronoun, as if it was modifying some noun as in the third and fourth examples..
Is this just an issue of training set? Would it be possible to create my own training set that does a better job of doing this, basically a set that just has tons of his/hers sentences?
Bonus points if you can tell me whether there's any way to use NLP to determine the antecedent of a pronoun. For example:
"Wanda gave a watch to a girl named Lucy. She loved it."
My guess is this is pretty much impossible since this is sometimes even hard for humans.
Upvotes: 1
Views: 342
Reputation: 363817
Judging from the examples, you could try replacing his
with hers
instead of her
whenever it appears as the only child of a node, which to my knowledge of English (not a native speaker) corresponds to the usage of words like "hers", "mine", etc.
I.e.
# NP with one child
(NP (PRP$ his)) ==> (ADJP (JJ hers))
but
# NP with two children, "his" and "box"
(NP (PRP$ his) (NN box)) ==> (NP (PRP$ her) (NN box))
(It's been long since I did anything with syntax trees, but in the first example, the NP
label seems like a mistake by the parser.)
Bonus points if you can tell me whether there's any way to use NLP to determine the antecedent of a pronoun.
This is called pronoun resolution, or more generally anaphora resolution, and a host of literature exists about this problem. The baseline algorithm for this task is called Hobbs' algorithm and is described somewhere in SLP, or in this question.
Upvotes: 2