Reputation: 2434
I have a dataframe (data) with 2 records:
id text
0001 The farmer plants grain
0002 tuna
I want to count the number of words in the text
column of this dataframe and drop rows with only one word.
I know how to count the number of words:
count = data['text'].str.split().str.len()
How do I use the results to run an IF statement that will drop rows in the dataframe? Any IF statements such as...
if count == 1:
print('drop')
...results in this error:
Traceback (most recent call last):
File "<ipython-input-118-b3fcb0218e8e>", line 32, in <module>
if count == 1:
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\generic.py", line 917, in __nonzero__
.format(self.__class__.__name__))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I have read the Pandas documentation and other SO questions around this error, but I can't seem to get the solutions to apply correctly to my issue with the IF statement.
Any advise is greatly appreciated! As I am relatively new to SO, please let me know if there's anything I can do to improve my question.
Upvotes: 4
Views: 4729
Reputation: 33
Probably late to answer but it could help new viewers.
You can easly find the indexes of the rows, that match what you want and drop them from the dataframe.
wantedRows = data[data['text'].str.split().str.len()==1].index
data = data.drop(wantedRows, axis = 0)
Upvotes: 1
Reputation: 1466
Just use a conditional argument for the dataframe. It would be like this:
df = df[df['column'].str.contains(' ')]
Asuming there is a space between the words.
Upvotes: 0
Reputation: 294258
I'd just see if it has a space
data = data[data.text.str.contains(' ')]
data
id text
0 0001 The farmer plants grain
Or more generally using count
data = data[data.text.str.count(' ') > 0]
data
id text
0 0001 The farmer plants grain
What was wrong!
count = data['text'].str.split().str.len()
running this results in count
being a pandas.Series
of lengths.
count == 1
is a pandas.Series
of truth values. if count == 1
makes no sense because it attempts to determine if the entire series is True
. And it isn't True
or False
. You have to use it differently to accomplish your goals. I've offered a way to do that. So has @StevenG.
Upvotes: 2
Reputation: 17122
use a mask:
dropped = data[~(count==1)].copy()
explanation:
so assuming a df such has:
data = pd.DataFrame({'text': ['hello my name is','hey']})
using your count method you could check if it =1 or not, creating a boolean mask :
count = data['text'].str.split().str.len()
~(count==1)
Out[18]:
0 True
1 False
Name: text, dtype: bool
now you can apply that mask :
data[~(count==1)]
Out[22]:
text
0 hello my name is
Upvotes: 3