How to remove rows that have 3 word or less in dataframe?

Question

Because I want to remove ambiguity when I train the data. I want to clean it well. So how can I remove all rows that contain 3 words or less in python?

Jos&#233; Rodrigues · Accepted Answer

Hello World! This will be my first contribution ever to SO :-)

Let's create some data:

data = { 'Source':['Hello all Im Happy','Its a lie, dont trust him','Oops','foo','bar']}
df = pd.DataFrame (data, columns = ['Source'])

My approach is very straight forward, simple and little "brute" and inefficient,howver I ran this in a large dataframe (1013952 rows) and the time was fairly acceptable. let's find the indices of the data frame where there are more than n tokens:

from nltk.tokenize import word_tokenize


def get_indices(df,col,n): 
"""
Get the indices of dataframe where exist more than n tokens in a specific column

Parameters:

   df(pandas dataframe)
   n(int): threshold value for minimum words
   col(string): column name 

"""      


tmp = []
for i in range(len(df)):#df.iterrows() wasnt working for me
    if len(word_tokenize(df[col][i])) < n:
        tmp.append(i)
return tmp

Next we just need to call the function and drop the rows and said indices:

tmp = get_indices(df)
df_clean = df.drop(tmp)

Best!

How to remove rows that have 3 word or less in dataframe?

Answers (2)

Related Questions