Irazall
Irazall

Reputation: 157

Tool for detecting differences between text passages from two different groups

I have text data from two different groups. In total I have around 4000 text passages with around 300 words.

I am searching for a tool that allows me to analyze the difference between these two groups.

In the best case, this tool can analyze different dimensions, e.g. the length of sentences, usage of superlatives, perspective of the narrator, usage of passive form, clear and objective writing VS hedging and imprecise writing.

Upvotes: 0

Views: 115

Answers (1)

larapsodia
larapsodia

Reputation: 624

In Python, you can use the nltk or spacey packages to process the texts so that you can analyze them (using pandas, for example). But there's not ready-made software (as far as I know) that will do all of that for you. You're going to have to write your own code.

For example, you would create a pandas dataframe with a row for all of the texts, with their group ('A' or 'B' or whatever) as one of the columns and the raw text as the other. Then you use nltk to tokenize the text and do whatever other preprocessing you want to do, storing the clean, tokenized text in another column. Then you can have a column for, for example, sentence length (which you can compute using nltk). From there you'll be able to get the means of the two groups, standard deviation, statistical significance of difference, etc.

It's straightforward for something like sentence length, but the other features you mention are more difficult. What does it mean for a text to be clear and objective, or hedged and imprecise? That means nothing on its own: you have to decide what exactly you mean by that, and what features characterize it. For example, you could make a list of hedgers ('I think', 'may', 'might', 'I'm not sure but', etc.) and then count their frequency in each text.

Something like "perspective of the narrator" might need to be annotated manually, depending on what you mean by it. If you just mean 1st person vs. 3rd person, that could be easy to identify (compare the 'I's vs. the 'he/she's), but anything more subtle than that, I'm not sure how you'd do it.

Good luck with your project!

Upvotes: 1

Related Questions