Reputation: 169
I'm running into an issue that I hope is simple, however I've run into a wall trying to figure it out. I'm attempting to strip the DateTime timestamp from the beginning of each line in a file, however the returned information is cutting off some of the characters that I'd like to keep. I was fairly sure my regex is OK, and based on the regex.group() output, it looks good. I find that lines with the letters "c" and "e" seem to get their characters trimmed off, while other lines work as expected.
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
>>> import re
>>>
>>> line2 = '[Wed Dec 01 10:24:24 2010] ceeeeest'
>>> a = re.match(r'(\[[A-Za-z]{3}\s)?([A-Za-z]{3})(\s+)([0-9]{1,4})(\s+)([0-9]{2})(:)([0-9]{2})(:)([0-9]{2})(\s[0-9]{1,4})?(\])?', line2, re.I)
>>> a.group()
'[Wed Dec 01 10:24:24 2010]'
>>> a.groups()
('[Wed ', 'Dec', ' ', '01', ' ', '10', ':', '24', ':', '24', ' 2010', ']')
>>> b = a.group()
>>> b
'[Wed Dec 01 10:24:24 2010]'
>>> c = line2.strip(b)
>>> c
'st'
>>>
I expect C to be "ceeeeest"
OR
>>> line = '[Wed Dec 01 10:24:24 2010] testc'
>>> a = re.match(r'(\[[A-Za-z]{3}\s)?([A-Za-z]{3})(\s+)([0-9]{1,4})(\s+)([0-9]{2})(:)([0-9]{2})(:)([0-9]{2})(\s[0-9]{1,4})?(\])?', line, re.I)
>>> a.group()
'[Wed Dec 01 10:24:24 2010]'
>>> a.groups()
('[Wed ', 'Dec', ' ', '01', ' ', '10', ':', '24', ':', '24', ' 2010', ']')
>>> b = a.group()
>>> c = line.strip(b)
>>> c
'test'
>>>
I expect c to be "testc"
Is there something very basic I am missing here? Please enlighten me. Thank you.
Upvotes: 3
Views: 14733
Reputation: 180481
b
is '[Wed Dec 01 10:24:24 2010]'
so then you strip any of the characters that are in b from c
so everything bar ct
get removed:
'[Wed Dec 01 10:24:24 2010] ceeeeest'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# all in [Wed Dec 01 10:24:24 2010]
So only st
remain as they are the only two characters no in b
, strip will keep stripping from both ends until it hits char not in the set:
In [3]: s = "fooboaroof"
In [4]: s.strip("foo")
Out[4]: 'boar'
If the date is always at the start which it must be if you are using match, when you get a match the simplest would be to split:
line2 = '[Wed Dec 01 10:24:24 2010] ceeeeest'
print(line2.split("] ", 1)[1])
Or:
print(line2[len(a.group()):].lstrip())
Upvotes: 0
Reputation: 15310
If you really want to strip
(that is, discard) the date and time information, and if the information is in the format you represent, try this:
#! python3
lines = [
'[Wed Dec 01 10:24:24 2010] ceeeeest',
'[Wed Dec 01 10:24:24 2010] testc',
'just a plain old line',
' indented',
' with [brackets]',
'[BOGUS! This should be disallowed!',
'[][][] Three pairs',
]
for line in lines:
if line.startswith('['):
try:
line = line[line.index(']')+2:]
except ValueError:
print('Invalid formatting: open [ with no close!')
else:
print(line)
else:
print('Ho hum, nothing interesting about:', line)
Upvotes: -1
Reputation: 1090
If I get what you're attempting to do right, you can just use a regex to extract the word/sentence afterwards:
import re
regex = re.compile(r'(?:\s*\[.*?\])(.*)')
sentence = regex.findall(line)[0].strip()
Note that I have omitted the verification that you had in your regex, you can still use it.
Upvotes: -1
Reputation: 6141
if you have repeat items with same pattern in your string, you can use regex find all the match then replace it to empty string
import re
pattern = r'\[\w{3} \w{3} \d{2} \d{2}:\d{2}:\d{2} \d{4}\] '
for p in re.findall(pattern,line):
line = line.replace(p,'')
Upvotes: 0
Reputation: 1433
As others have pointed out, you are using strip
incorrectly. Instead, since you already have matching working, slice off the number of characters from the start of the string.
result = line[:len(a.group())]
print(result)
# prints ' testc'
Upvotes: 0
Reputation: 9010
The method str.strip
will remove all characters from the beginning and end of the string that are in the argument. You probably want to use str.replace
instead.
>>> line = '[Wed Dec 01 10:24:24 2010] testc'
>>> line.replace('[Wed Dec 01 10:24:24 2010]', '')
' testc'
You can get rid of the leading white space by using str.lstrip
, or use str.strip
if you want to get rid of trailing white space too (the default arguments are white space).
Upvotes: 4