snapcrack
snapcrack

Reputation: 1811

regex not recoginizing matches as True

I have a dataframe with text data and I'm trying to clean out rows with empty content values. I have one row whose content column looks like this:

articles.loc[197040, 'content']
'     '

I've tried cleaning it up with .isnull(), but that doesn't recognize empty strings. So I resorted to regex and tried:

nothing = re.compile(r'\W{1,}')
articles = articles[articles['content'] != nothing]

But this leaves the empty articles in. If I try:

'     ' == nothing

I get False. But the regex tester seems to indicate that that should work. Using r'\W*' also returns False.

The problem persists with other meaningless strings---e.g., a mix of commas and whitespace---when other regex combinations are tried.

Thanks for any help.

Edit:

It's also not recognizing equivalence here:

'what.' == re.compile(r'\w*\.')
False

Or here:

'6:45' == r'[^A-Z]{1,}'
False

And so on and so forth.

Upvotes: 1

Views: 66

Answers (2)

bogdanciobanu
bogdanciobanu

Reputation: 1693

To check if a regex matches a string you have to use the match method, not to check for equality. You're basically comparing a string with a pattern object which, of course, are not equal. Try this:

nothing.match('    ') # out: <_sre.SRE_Match object; span=(0, 4), match='    '>
x.match(' , , ,') # out: <_sre.SRE_Match object; span=(0, 6), match=' , , ,'>

Upvotes: 0

Vin&#237;cius Figueiredo
Vin&#237;cius Figueiredo

Reputation: 6508

You can workaround the problem using isspace built-in, it returns true if there are only whitespace characters in the string and there is at least one character.


Demo, also filtering empty strings:

import pandas as pd
articles =  pd.DataFrame({'content' : ['foo','bar','   ','foo','    ','']})    
articles = articles[(~articles['content'].str.isspace()) & (articles['content'] != '')]

>>> articles
  content
0     foo
1     bar
3     foo

Upvotes: 1

Related Questions