Reputation: 1713
I'm doing some analysis on GitHub comments. But for that, I need to exclude the code samples and error messages from the comments automatically from a large set.
The other easier way to say this would be, I can keep only the English part of the comments. Although there are few libraries to detect the language of a sentence, there are few challenges in my case too. 1) the comment part does not always follow proper English grammar, 2) the code sample and error message mainly consist of English words too.
So what should be my best approach. The results don't need to be 100% accurate, I just want to know the best approach that can give me a satisfactory result at least. Any idea?
Upvotes: 1
Views: 1191