Reputation: 75
I have two files file1.csv
and file2.csv
. file1.csv
contains a stupid
sentence in each row. file2.csv
identify which column it is (type0
corresponding to 0
, type1
corresponding to 1
). I want to do a NLP classification task and I know usually how to do it. But in this situation I am bit confused and do not know how to arrange and organize my dataset, so that I can train my sentences and labels. Appreciate if someone give me a hint to progress.
file1.csv
in the following format,
id,type0,type1
0,He married to a dinosaur.,He married to a women.
1,She drinks a beer.,She drinks a banana.
2,He lifted a 500 tons.,He lifted a 50kg.
file2.csv
in the following format.
id,stupid
0,0
1,1
2,0
My purpose is to classify the stupid sentences.
Upvotes: 1
Views: 813
Reputation: 15593
Bigrams won't work for this - "a dinosaur" and "married a" are normal bigrams.
The simplest thing you can do is record token collocations. Break your document into sentences, and record how many times "dinosaur" and "married" (or whatever) occur in the same sentence. You should then be able to train a classifier on your labeled sentences to classify them. Intuitively this works the same as bigrams but it captures more long-range relationships.
A more sophisticated approach would be to classify subject-verb-object sets as reasonable or unreasonable. Use a dependency parse to get the (subject, verb, object) triples, then label them as reasonable or unreasonable, use word vectors as input, and train a classifier. If you do this then your model should be able to tell that "She married a dinosaur" is more strange than "She married a plumber" because "plumber" is closer to "man" in vector space than "dinosaur" is.
I would also avoid classifying your examples individually if they always come in pairs. You can train a binary classifier that works on single instances, but compare the likelihood of the nonsense class between the two and pick the "more nonsensical" one. That way you can easily enforce the constraint that exactly one is nonsense.
Sounds like an interesting project, good luck with it! It's not the same but you might be interested in the classical problem of Winograd Schemas, and some of the approaches to solving that might be helpful to you. The concept of "selection" from linguistics is also relevant.
Upvotes: 0
Reputation: 11
I think using bigrams in such cases would be useful. That is considering two words at a time.
Upvotes: 1
Reputation: 1425
Assuming that, 100% of the time, there will be a sentence that is semantically correct, and another that isn't, you can just split the type0
and type1
sentences into 2 different examples and classify them individually, e.g.:
id,type0,type1
0,He married to a dinosaur.,He married to a women.
1,She drinks a beer.,She drinks a banana.
2,He lifted a 500 tons.,He lifted a 50kg.
Becomes:
id,sentence
0,He married to a dinosaur
1,He married to a women.
2,She drinks a beer.
3,She drinks a banana.
4,He lifted a 500 tons.
5,He lifted a 50kg.
However, this won't work if your data contains records where a sentence is slightly less stupid than the other, i.e. there's the actual need to compare both sentences.
Upvotes: 1
Reputation: 916
Maybe you can consider not only unigrams (treat each word individually as a variable) but also use bi-grams. this can help identifying combinations of words that are no-sense. (clean the text from stop words first..)
so you would have variables such as "married dinosaur" or "drink bear" instead of each word alone.
I d use tidytext (for R) but if you re looking for something similar in Python you could check out this
https://github.com/michelleful/TidyTextMining-Python
Upvotes: 1