Sergiy Polovynko
Sergiy Polovynko

Reputation: 55

Split a string with RegEx

Good time of the day,

Currently I am little bit stuck on a challenge. I have to make a word count within a phrase, I have to split it by empty spaces or any special cases present.

import re

def word_count(string):
    counts = dict()
    regex = re.split(r" +|[\s+,._:+!&@$%^🖖]",string)
    for word in regex:
        word = str(word) if word.isdigit() else word
        if word not in counts:
            counts[word] = 1
        else:
            counts[word] += 1
    return counts

However I am stuck at Regex part. While splitting, empty space are taken also in account

I started with using

for word in string.split():

But it does not pass the test wiht phrases such as:

"car : carpet as java : javascript!!&@$%^&"

"hey,my_spacebar_is_broken."

'до🖖свидания!'

Hence, if I understand, RegEx is needed.

Thank you very much in advance!

Upvotes: 0

Views: 93

Answers (3)

Miguel
Miguel

Reputation: 2219

Thanks to Olvin Roght for his suggestions. Your function can be elegantly reduced to this.

import re
from collections import Counter

def word_count(text):
    count=Counter(re.split(r"[\W_]+",text))
    del count[''] 
    return count

See Ryszard Czech's answer for an equivalent one liner.

Upvotes: 2

Ryszard Czech
Ryszard Czech

Reputation: 18611

Use

import re
from collections import Counter

def word_count(text):
    return Counter(re.findall(r"[^\W_]+",text))

[^\W_]+ matches one or more characters different from non-word and underscore chars. This matches one or more letters or digits in effect.

See regex proof.

Upvotes: 1

Rabindra
Rabindra

Reputation: 401

Change the regex pattern as below. No need to use ' +| in the pattern as you are already using '\s'. Also, note the '+'.

regex = re.split(r"[\s+,._:+!&@$%^🖖]+", string)

Upvotes: 0

Related Questions