Reputation: 698
I am parsing a text log, where each line contains an id closed in parenthesis and one or more (possibly hundreds) chunks of data (alphanumeric, always 20 chars), such as this:
id=(702831), data1=(Ub9fS97Hkc570Vvqkdy1), data2=(Hd7t553df8mnOa84wTcF)
id=(702832), data1=(Ba6FGoP5Dzxwmb6JhJ5a)
At this point of the program, I am not interested about the data, just about quick fetching of all the id
s. The problem is, that due to the noisy communication channel an error may appear denoted by string Error
that can be anywhere on the line. The goal is to ignore these lines.
What worked for me so far was a simple negative lookahead:
^id=\((\d+)\),(?!.*Error)
But I forgot, that there is some tiny probability, that this Error
string may actually appear as a valid sequence of characters somewhere in the data, which has backfired on me just now.
The only way to distinguish between valid and invalid appearance of the Error
string in the data chunk is to check for the length. If it's 20 characters, then it was this rare valid occurrence and I want to keep it (unless the Error
is elsewhere on the line), if it's longer, I want to discard the line.
Is it still possible to treat this situation with a regular expression or is it already too much for the regex monster?
Thanks a lot.
Edit: Adding examples of error lines - all these should be ignored.
iErrord=(702831), data1=(Ub9fS97Hkc570Vvqkdy1), data2=(Hd7t553df8mnOa84wTcF)
id=(7028Error32), data1=(Ba6FGoP5Dzxwmb6JhJ5a)
id=(702833), daErrorta1=(hF6eDpLxbnFS5PfKaCds)
id=(702834), data1=(bx5EsH7BCsk6dMzpQDErrorKA)
However this one should not be ignored, the Error
is just incidently contained in the data part, but it currently is ignored
id=(702834), data1=(bx5EsH6dMzpQDErrorKA)
Upvotes: 2
Views: 592
Reputation: 9644
Since your chunks of data are always 20 characters long, if one is 25 characters this means there is an error in it. Therefore you could check if there is a chunk of such a length, then check if there is Error
outside of parenthesis. If so, you shouldn't match the line. If not, it valid.
Something like
(?![^)]*Error)id=\((\d+)(?!.*(?:\(.{25}\)|\)[^(]*Error))
might do the trick.
Upvotes: 1
Reputation: 1363
Alright, it's not exactly what you were thinking about, but here's a suggestion :
Can't you simply match the lines following the pattern, undisturbed by an Error somewhere ?
Here's the regexp that'll do it :
^id=\((\d+)\), (data\d+=\([a-zA-Z\d]{20}\)(, )?)+$
If Error is anywhere on the line (except in the middle of the chunk of data), the regexp will not match it, so you get the wanted result, it'll be ignored.
If this doesn't please you, you have to add more lookahead and lookbehind groups. I'll try to do that and edit if I write a good regexp.
Upvotes: 1