OverflowingTheGlass
OverflowingTheGlass

Reputation: 2434

Use word count in Pandas dataframe to drop rows with only one word

I have a dataframe (data) with 2 records:

id    text
0001  The farmer plants grain
0002  tuna

I want to count the number of words in the text column of this dataframe and drop rows with only one word.

I know how to count the number of words:

count = data['text'].str.split().str.len()

How do I use the results to run an IF statement that will drop rows in the dataframe? Any IF statements such as...

if count == 1:
    print('drop')

...results in this error:

Traceback (most recent call last):

  File "<ipython-input-118-b3fcb0218e8e>", line 32, in <module>
    if count == 1:

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\generic.py", line 917, in __nonzero__
    .format(self.__class__.__name__))

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I have read the Pandas documentation and other SO questions around this error, but I can't seem to get the solutions to apply correctly to my issue with the IF statement.

Any advise is greatly appreciated! As I am relatively new to SO, please let me know if there's anything I can do to improve my question.

Upvotes: 4

Views: 4729

Answers (4)

Sam
Sam

Reputation: 33

Probably late to answer but it could help new viewers.
You can easly find the indexes of the rows, that match what you want and drop them from the dataframe.

wantedRows = data[data['text'].str.split().str.len()==1].index 
data =  data.drop(wantedRows, axis = 0)

Upvotes: 1

Antonio L&#243;pez Ruiz
Antonio L&#243;pez Ruiz

Reputation: 1466

Just use a conditional argument for the dataframe. It would be like this:

df = df[df['column'].str.contains(' ')]

Asuming there is a space between the words.

Upvotes: 0

piRSquared
piRSquared

Reputation: 294258

I'd just see if it has a space

data = data[data.text.str.contains(' ')]
data

     id                     text
0  0001  The farmer plants grain

Or more generally using count

data = data[data.text.str.count(' ') > 0]
data

     id                     text
0  0001  The farmer plants grain

What was wrong!

count = data['text'].str.split().str.len()

running this results in count being a pandas.Series of lengths.

count == 1

is a pandas.Series of truth values. if count == 1 makes no sense because it attempts to determine if the entire series is True. And it isn't True or False. You have to use it differently to accomplish your goals. I've offered a way to do that. So has @StevenG.

Upvotes: 2

Steven G
Steven G

Reputation: 17122

use a mask:

dropped = data[~(count==1)].copy()

explanation:

so assuming a df such has:

data = pd.DataFrame({'text': ['hello my name is','hey']})

using your count method you could check if it =1 or not, creating a boolean mask :

count = data['text'].str.split().str.len()
~(count==1)
Out[18]: 
0     True
1    False
Name: text, dtype: bool

now you can apply that mask :

data[~(count==1)]
Out[22]: 
               text
0  hello my name is

Upvotes: 3

Related Questions