GeoffWillis
GeoffWillis

Reputation: 123

RegEx processing with python

I am trying to learn python and do text analysis using NLTK at the same time.

I am using python to scrub text before text analysis.

Given the sentence: The target IP was: 127.1.1.100.

I want to tokenize it into:

["The", "target", "IP", "was", ":","127.1.1.100","."]

It is important I retain all the punctuation so as to reconstruct the source doc, but I need leading/trailing punctuation separated so I can do text analysis on the individual words. I wrote the following python code which works fine, but seems kinda kludgy.

punct = ['.', ',', ':', ';', '!', '[', ']', '(', ')', '{', '}']
def split_punctuation(sentence)-> list:
    sentwords = sentence.split(" ")
    for i, word in enumerate(sentwords):
        if word_ends_with_punct(word) and len(word) > 1:
            sentwords.pop(i)
            sentwords.insert(i, word[:-1])
            sentwords.insert(i+1, word[-1])
            word = word[:-1]
        if word_starts_with_punct(word) and len(word) > 1:
            sentwords.pop(i)
            sentwords.insert(i, word[0:1])
            sentwords.insert(i+1, word[1:])
            word = word[1:]
    return sentwords

def word_starts_with_punct(w)-> bool:
    for p in punct:
        if w.startswith(p):
            return True
    return False

def word_ends_with_punct(w)->bool:
    for p in punct:
        if w.endswith(p):
            return True
    return False

So looking on SO I found a regex that does what I want, kinda... RegEx by Wiktor Stribiżew

re.sub(r'[]!"$%&\'()*+,./:;=#@?[\\^_`{|}~-]+', r' \g<0> ', my_text).strip()

I was able to figure out what's going on, but in this form it separates ALL punctuation, even in the middle of words. For example, it converted todays date from: 6/28/2109 to "6 / 28 / 2019".

So I modified to use anchor tags at beginning/end but it seems I have to run it twice, once for beginning punctuation, and again for the end. Seems rather inefficient and was hoping somebody could show the the correct way to accomplish this. The below code is the regex version:

def sep_punct_by_regex(sent)->list :
    words = sent.split(" ")
    new_words = []
    for w in words:
        tmp1 = re.sub(r'^[]!"$/%&\'()*+,.:;=#@?[\\^_`{|}~-]+', r' \g<0> ', w).strip()
        tmp2 = re.sub(r'[]!"$/%&\'()*+,.:;=#@?[\\^_`{|}~-]+$', r' \g<0> ', tmp1).strip()
        t = tmp2.split(" ")
        for x in t:
            new_words.append(x)
    return new_words

Note the ^ in the tmp1, and $ in tmp2 This works as is, but the goal is to learn while building so how would I modify the RegEx for single pass? I tried the obvious (^) up front, and the $ at the end, but it doesn't work.

Upvotes: 1

Views: 82

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626747

You may use

re.findall(r'\b(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])(?:\.(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])){3}\b|[^\W_]+|(?:[^\w\s]|_)+', s)

See the regex demo

To remove the punctuation on both ends of a string and strip from whitespaces use

re.sub(r'^[\W_]+|[\W_]+$', '', s).strip()

So, it will look like

s = re.sub(r'^[\W_]+|[\W_]+$', '', s).strip()
oct = r'(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])'
return re.findall(r'\b{0}(?:\.{0}){{3}}\b|[^\W_]+|(?:[^\w\s]|_)+'.format(oct), s)

Details

  • \b(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])(?:\.(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])){3}\b - an IPv4 regex pattern
  • | - or
  • [^\W_]+ - one or more letters or digits
  • | - or
  • (?:[^\w\s]|_)+ - one or more chars other than word and whitespace chars or _.

Upvotes: 1

Related Questions