Reputation: 1811
I have a dataframe with text data and I'm trying to clean out rows with empty content values. I have one row whose content column looks like this:
articles.loc[197040, 'content']
' '
I've tried cleaning it up with .isnull(), but that doesn't recognize empty strings. So I resorted to regex and tried:
nothing = re.compile(r'\W{1,}')
articles = articles[articles['content'] != nothing]
But this leaves the empty articles in. If I try:
' ' == nothing
I get False
. But the regex tester seems to indicate that that should work. Using r'\W*'
also returns False
.
The problem persists with other meaningless strings---e.g., a mix of commas and whitespace---when other regex combinations are tried.
Thanks for any help.
It's also not recognizing equivalence here:
'what.' == re.compile(r'\w*\.')
False
Or here:
'6:45' == r'[^A-Z]{1,}'
False
And so on and so forth.
Upvotes: 1
Views: 66
Reputation: 1693
To check if a regex matches a string you have to use the match method, not to check for equality. You're basically comparing a string with a pattern object which, of course, are not equal. Try this:
nothing.match(' ') # out: <_sre.SRE_Match object; span=(0, 4), match=' '>
x.match(' , , ,') # out: <_sre.SRE_Match object; span=(0, 6), match=' , , ,'>
Upvotes: 0
Reputation: 6508
You can workaround the problem using isspace
built-in, it returns true if there are only whitespace characters in the string and there is at least one character.
Demo, also filtering empty strings:
import pandas as pd
articles = pd.DataFrame({'content' : ['foo','bar',' ','foo',' ','']})
articles = articles[(~articles['content'].str.isspace()) & (articles['content'] != '')]
>>> articles
content
0 foo
1 bar
3 foo
Upvotes: 1