Lobsta
Lobsta

Reputation: 27

Regex replacement for strip()

Long time/first time.

I am a pharmacist by trade by am going through the motions of teaching myself how to code in a variety of languages that are useful to me for things like task automation at work, but mainly Python 3.x. I am working through the automatetheboringstuff eBook and finding it great.

I am trying to complete one of the practice questions from Chapter 7: "Write a function that takes a string and does the same thing as the strip() string method. If no other arguments are passed other than the string to strip, then whitespace characters will be removed from the beginning and end of the string. Otherwise, the characters specified in the second argument to the function will be removed from the string."

I am stuck for the situation when the characters I want to strip appear in the string I want to strip them from e.g. 'ssstestsss'.strip(s)

#!python3
import re

respecchar = ['?', '*', '+', '{', '}', '.', '\\', '^', '$', '[', ']']


def regexstrip(string, _strip):
    if _strip == '' or _strip == ' ':
        _strip = r'\s'
    elif _strip in respecchar:
        _strip = r'\'+_strip'
    print(_strip) #just for troubleshooting 
    re_strip = re.compile('^'+_strip+'*(.+)'+_strip+'*$')
    print(re_strip) #just for troubleshooting 
    mstring = re_strip.search(string)
    print(mstring) #just for troubleshooting 
    stripped = mstring.group(1)
    print(stripped)

As it is shown, running it on ('ssstestsss', 's') will yield 'testsss' as the .+ gets all of it and the * lets it ignore the final 'sss'. If I change the final * to a + it only improves a bit to yield 'testss'. If I make the capture group non-greedy (i.e. (.+)? ) I still get 'testsss' and if exclude the character to be stripped from the character class for the capture group and remove the end string anchor (i.e. re.compile('^'+_strip+'*([^'+_strip+'.]+)'+_strip+'*') I get 'te' and if I don't remove the end string anchor then it obviously errors.

Apologies for the verbose and ramble-y question.

I deliberately included all the code (work in progress) as I am only learning so I realise that my code is probably rather inefficient, so if you can see any other areas where I can improve my code, please let me know. I know that there is no practical application for this code, but I'm going through this as a learning exercise.

I hope I have asked this question appropriately and haven't missed anything in my searches.

Regards

Lobsta

Upvotes: 1

Views: 1289

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626950

As I mentioned in my comment, you did not include special chars into the character class.

Also, the .* without a re.S / re.DOTALL modifier does not match newlines. You may avoid using it with ^PATTERN|PATTERN$ or \APATTERN|PATTERN\Z (note that \A matches the start of a string, and \Z matches the very end of the string, $ can match before the final newline symbol in a string, and thus, you cannot use $).

I'd suggest shrinking your code to

import re

def regexstrip(string, _strip=None):
    _strip = r"\A[\s{0}]+|[\s{0}]+\Z".format(re.escape(_strip)) if _strip else r"\A\s+|\s+\Z"
    print(_strip) #just for troubleshooting 
    return re.sub(_strip, '', string)

print(regexstrip(" ([no more stripping']  )  ", " ()[]'"))
# \A[\s\ \(\)\[\]\']+|[\s\ \(\)\[\]\']+\Z
# no more stripping
print(regexstrip(" ([no more stripping']  )  "))
# \A\s+|\s+\Z
# ([no more stripping']  )

See the Python demo

Note that:

  • The _strip argument is optional with a =None
  • The _strip = r"\A[\s{0}]+|[\s{0}]+\Z".format(re.escape(_strip)) if _strip else r"\A\s+|\s+\Z" inits the regex pattern: if _strip is passed, the symbols are put inside a [...] character class and escaped (since we cannot control the symbol positions much, it is the quickest easiest way to make them all treated as literal symbols).
  • With re.sub, we remove the matched substrings.

Upvotes: 2

HolyDanna
HolyDanna

Reputation: 629

You (.+) is greedy, (by default). Just change it to non greedy, by using (.+?)
You can test python regex at this site

edit : As someone commented, (.+?) and (.+)? do not do the same thing : (.+?) is the non greedy version of (.+) while (.+)? matches or not the greedy (.+)

Upvotes: 2

Related Questions