Reputation: 169

Python str.strip() with regex filtering unexpected characters

I'm running into an issue that I hope is simple, however I've run into a wall trying to figure it out. I'm attempting to strip the DateTime timestamp from the beginning of each line in a file, however the returned information is cutting off some of the characters that I'd like to keep. I was fairly sure my regex is OK, and based on the regex.group() output, it looks good. I find that lines with the letters "c" and "e" seem to get their characters trimmed off, while other lines work as expected.

Python 2.7.6 (default, Jun 22 2015, 17:58:13)

[GCC 4.8.2] on linux2

>>> import re
>>>
>>> line2 = '[Wed Dec 01 10:24:24 2010] ceeeeest'
>>> a = re.match(r'(\[[A-Za-z]{3}\s)?([A-Za-z]{3})(\s+)([0-9]{1,4})(\s+)([0-9]{2})(:)([0-9]{2})(:)([0-9]{2})(\s[0-9]{1,4})?(\])?', line2, re.I)
>>> a.group()
'[Wed Dec 01 10:24:24 2010]'
>>> a.groups()
('[Wed ', 'Dec', ' ', '01', ' ', '10', ':', '24', ':', '24', ' 2010', ']')
>>> b = a.group()
>>> b
'[Wed Dec 01 10:24:24 2010]'
>>> c = line2.strip(b)
>>> c
'st'
>>>

I expect C to be "ceeeeest"

>>> line = '[Wed Dec 01 10:24:24 2010] testc'
>>> a = re.match(r'(\[[A-Za-z]{3}\s)?([A-Za-z]{3})(\s+)([0-9]{1,4})(\s+)([0-9]{2})(:)([0-9]{2})(:)([0-9]{2})(\s[0-9]{1,4})?(\])?', line, re.I)
>>> a.group()
'[Wed Dec 01 10:24:24 2010]'
>>> a.groups()
('[Wed ', 'Dec', ' ', '01', ' ', '10', ':', '24', ':', '24', ' 2010', ']')
>>> b = a.group()
>>> c = line.strip(b)
>>> c
'test'
>>>

I expect c to be "testc"

Is there something very basic I am missing here? Please enlighten me. Thank you.

Upvotes: 3

Answers (6)

Padraic Cunningham

Reputation: 180481

b is '[Wed Dec 01 10:24:24 2010]' so then you strip any of the characters that are in b from c so everything bar ct get removed:

'[Wed Dec 01 10:24:24 2010] ceeeeest'
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^   
 # all in [Wed Dec 01 10:24:24 2010]

So only st remain as they are the only two characters no in b, strip will keep stripping from both ends until it hits char not in the set:

In [3]: s = "fooboaroof"

In [4]: s.strip("foo")
Out[4]: 'boar'

If the date is always at the start which it must be if you are using match, when you get a match the simplest would be to split:

line2 = '[Wed Dec 01 10:24:24 2010] ceeeeest'

print(line2.split("] ", 1)[1])

Or:

 print(line2[len(a.group()):].lstrip())

Upvotes: 0

aghast

Reputation: 15310

If you really want to strip (that is, discard) the date and time information, and if the information is in the format you represent, try this:

#! python3

lines = [
    '[Wed Dec 01 10:24:24 2010] ceeeeest',
    '[Wed Dec 01 10:24:24 2010] testc',
    'just a plain old line',
    '       indented',
    '      with [brackets]',
    '[BOGUS! This should be disallowed!',
    '[][][] Three pairs',
]

for line in lines:
    if line.startswith('['):
        try:
            line = line[line.index(']')+2:]
        except ValueError:
            print('Invalid formatting: open [ with no close!')
        else:
            print(line)
    else:
        print('Ho hum, nothing interesting about:', line)

Upvotes: -1

Work of Artiz

Reputation: 1090

If I get what you're attempting to do right, you can just use a regex to extract the word/sentence afterwards:

import re
regex = re.compile(r'(?:\s*\[.*?\])(.*)')
sentence = regex.findall(line)[0].strip()

Note that I have omitted the verification that you had in your regex, you can still use it.

Upvotes: -1

galaxyan

Reputation: 6141

if you have repeat items with same pattern in your string, you can use regex find all the match then replace it to empty string

import re
pattern = r'\[\w{3} \w{3} \d{2} \d{2}:\d{2}:\d{2} \d{4}\] '
for p in re.findall(pattern,line):
   line = line.replace(p,'')

Upvotes: 0

Marc J

Reputation: 1433

As others have pointed out, you are using strip incorrectly. Instead, since you already have matching working, slice off the number of characters from the start of the string.

result = line[:len(a.group())]
print(result)
# prints ' testc'

Upvotes: 0

Jared Goguen

Reputation: 9010

The method str.strip will remove all characters from the beginning and end of the string that are in the argument. You probably want to use str.replace instead.

>>> line = '[Wed Dec 01 10:24:24 2010] testc'
>>> line.replace('[Wed Dec 01 10:24:24 2010]', '')
' testc'

You can get rid of the leading white space by using str.lstrip, or use str.strip if you want to get rid of trailing white space too (the default arguments are white space).

Upvotes: 4

Python str.strip() with regex filtering unexpected characters

Answers (6)

Related Questions