sof
sof

Reputation: 75

Identifying Grammatically Correct Nonsense Sentences

I have two files file1.csv and file2.csv. file1.csv contains a stupid sentence in each row. file2.csv identify which column it is (type0 corresponding to 0, type1 corresponding to 1). I want to do a NLP classification task and I know usually how to do it. But in this situation I am bit confused and do not know how to arrange and organize my dataset, so that I can train my sentences and labels. Appreciate if someone give me a hint to progress.

file1.csv in the following format,

id,type0,type1
0,He married to a dinosaur.,He married to a women.
1,She drinks a beer.,She drinks a banana.
2,He lifted a 500 tons.,He lifted a 50kg.

file2.csv in the following format.

id,stupid
0,0
1,1
2,0

My purpose is to classify the stupid sentences.

Upvotes: 1

Views: 813

Answers (4)

polm23
polm23

Reputation: 15593

Bigrams won't work for this - "a dinosaur" and "married a" are normal bigrams.

The simplest thing you can do is record token collocations. Break your document into sentences, and record how many times "dinosaur" and "married" (or whatever) occur in the same sentence. You should then be able to train a classifier on your labeled sentences to classify them. Intuitively this works the same as bigrams but it captures more long-range relationships.

A more sophisticated approach would be to classify subject-verb-object sets as reasonable or unreasonable. Use a dependency parse to get the (subject, verb, object) triples, then label them as reasonable or unreasonable, use word vectors as input, and train a classifier. If you do this then your model should be able to tell that "She married a dinosaur" is more strange than "She married a plumber" because "plumber" is closer to "man" in vector space than "dinosaur" is.

I would also avoid classifying your examples individually if they always come in pairs. You can train a binary classifier that works on single instances, but compare the likelihood of the nonsense class between the two and pick the "more nonsensical" one. That way you can easily enforce the constraint that exactly one is nonsense.

Sounds like an interesting project, good luck with it! It's not the same but you might be interested in the classical problem of Winograd Schemas, and some of the approaches to solving that might be helpful to you. The concept of "selection" from linguistics is also relevant.

Upvotes: 0

Sanjana Reddy Nagam
Sanjana Reddy Nagam

Reputation: 11

I think using bigrams in such cases would be useful. That is considering two words at a time.

Upvotes: 1

GRoutar
GRoutar

Reputation: 1425

Assuming that, 100% of the time, there will be a sentence that is semantically correct, and another that isn't, you can just split the type0 and type1 sentences into 2 different examples and classify them individually, e.g.:

id,type0,type1
0,He married to a dinosaur.,He married to a women.
1,She drinks a beer.,She drinks a banana.
2,He lifted a 500 tons.,He lifted a 50kg.

Becomes:

id,sentence
0,He married to a dinosaur
1,He married to a women.
2,She drinks a beer.
3,She drinks a banana.
4,He lifted a 500 tons.
5,He lifted a 50kg.

However, this won't work if your data contains records where a sentence is slightly less stupid than the other, i.e. there's the actual need to compare both sentences.

Upvotes: 1

Carbo
Carbo

Reputation: 916

Maybe you can consider not only unigrams (treat each word individually as a variable) but also use bi-grams. this can help identifying combinations of words that are no-sense. (clean the text from stop words first..)

so you would have variables such as "married dinosaur" or "drink bear" instead of each word alone.

I d use tidytext (for R) but if you re looking for something similar in Python you could check out this

https://github.com/michelleful/TidyTextMining-Python

Upvotes: 1

Related Questions