Reputation: 4618

Remove words that are only punctuations in pandas series

Imagine I have the following pandas series:

tmp = pd.Series(['k.; mlm', '(+).', 'a;b/c', '!".: abc', 'abc dfg', 'qwert@'])

And I want, for all elements, remove the words that are only punctuations using regex, I was trying to use something like:

tmp.str.replace(regex, '')

My final series would be:

tmp = pd.Series(['k.; mlm', '', 'a;b/c', 'abc', 'abc dfg', 'qwert@'])

Edit: I'm considering punctuation by the unicode table

Upvotes: 3

Answers (4)

Wiktor Stribiżew

Reputation: 627292

It looks as if you planned to clear a field value (replace it all with an empty string) if the whole string consists of punctuation.

You may do that with

tmp.str.replace(r'^(?:[^\w\s]|_)+$', '')

See the regex demo. NOTE: If you only plan to clear the value of rows that only consist of ASCII punctuation, you may use string.punctuation:

tmp.str.replace(f"^[{''.join(map(re.escape,string.punctuation))}]+$", '')

print(f"[{''.join(map(re.escape,string.punctuation))}]") shows [!"\#\$%\&'\*\+,\-\./:;<=>\?@\[\\\]\^_`\{\|\}\~], see its online demo. As expected, it does not match punctuation like ’, ‘, “, ”, «, », etc.

Details

^ - start of string
(?: - start of a non-capturing group
- [^ - start of a negated character class (it will match all chars BUT the ones specified inside it):
  - \w - word chars (any Unicode letters, digits, and _)
  - \s - any Unicode whitespace
- ]+ - end of the class, + repeats it 1 or more times
- | - or
- _ - an underscore
) - end of a group
$ - end of string.

Pandas test:

>>> tmp.str.replace(r'^(?:[^\w\s]|_)+$', '')
0     k.; mlm
1            
2       a;b/c
3    !".: abc
4     abc dfg
5      qwert@
dtype: object

Upvotes: 1

yatu

Reputation: 88285

You could use str.contains with the pattern [^\W] to match strings that contain at least one character which is not a punctuations sign.

Note that [] matches any character contained in the set, and by adding ^ at the beginning, all the characters that are not in the set will be matched.

tmp.where(tmp.str.contains(r'[^\W]'), '')

0     k.; mlm
1            
2       a;b/c
3    !".: abc
4     abc dfg
5      qwert@
dtype: object

Upvotes: 2

wwnde

Reputation: 26676

IICU

tmp.replace('[()+!".:]', '', regex=True).to_list()

OUTCOME

['k; mlm', '', 'a;b/c', ' abc', 'abc dfg', 'qwert@']

Explanation [] in this case contains characters to match df. replace Replaces values given in to_replace with value. I set Regex =True because I have used regex expression. Finally I convert them to list by df.to_list() function

Upvotes: 1

Vaishali

Reputation: 38415

You can use str.replace with negative lookahead regex, it looks for a string containing any alpha-numeric character (denoted by \w)

tmp.replace('^((?!\w).)*$', '', regex=True)

0     k.; mlm
1            
2       a;b/c
3    !".: abc
4     abc dfg
5      qwert@

Upvotes: 1

Remove words that are only punctuations in pandas series

Answers (4)

Related Questions