sanjana jha
sanjana jha

Reputation: 249

How to find frequency of repeated sentence in a file

I have dataframe where I need to find the top 20 repeated sentence using Python, Please let me know how to go about it

Column A
Hello How are you?
This ticket is not valid
How are things at you end?
Hello How are you?
How can I help you?
Please help me with tickets
This ticket is not valid
Hello How are you?

Expected Output

Column A                         Frequency of Repeated sentence
Hello How are you?               3
This ticket is not valid         2
How can I help you?              1
.
.
.

Code so far

df = pd.read_csv("C:\\Users\\aaa\\abc\\Analysis\\chat.csv", encoding="ISO-8859-1") 
df['word_count'] = df['Column A'].apply(lambda x: len(str(x).split(" ")))
df[['Column A','word_count']].head()

for i, g in df.groupby('Column A'):
   print ('Frequency of repeating sentence : {}'.format(g['Column A'].duplicated(keep=False).sum()))

I need the result in a dataframe which can be written to CSV with "Column A" and "Frequency" columns in the final result

Upvotes: 3

Views: 194

Answers (4)

Renaud
Renaud

Reputation: 2819

Try this:

df['count']=df.groupby(['ColumnA'] ).count()
df.sort_values(by='count', ascending=False)
print(df.head(20))

Upvotes: 2

N.Moudgil
N.Moudgil

Reputation: 879

Try this

freq_series= df.groupby(['Column A']).size()
new_df=pd.DataFrame({'ColumnA':freq_series.index,'frequency':freq_series.values})
new_df.to_csv('<your csv name>.csv')

Upvotes: 0

Swati Srivastava
Swati Srivastava

Reputation: 1157

df['count'] = df.groupby('Sentence')['Sentence'].transform('count')
df = df.sort_values(by = 'count', ascending = False)
df.head(20)

This will add a column 'count' to the original dataframe, which will contain the frequency of the corresponding sentence. transform() returns a Series that is aligned with the original dataframe.

Upvotes: 1

YOLO
YOLO

Reputation: 21749

Here's a way using .value_counts:

df['ColumnA'].value_counts()

To add it as a column, you can do:

df['Frequency'] = df['ColumnA'].map(df['ColumnA'].value_counts())

Upvotes: 4

Related Questions