el_loza
el_loza

Reputation: 47

Inverse Match Help in Python

Hello I am looking to trim a McAfee log file and remove all of the "is OK" and other reported instances that I am not interested in seeing. Before we used a shell script that took advantage of the -v option for grep, but now we are looking to write a python script that will work on both linux and windows. After a couple of attempts I was able to get a regex to work in an online regex builder, but I am having a difficult time implementing it into my script. Online REGEX Builder

Edit: I want to remove the "is OK", "is a broken", "is a block lines", and "file could not be opened" lines so then I am just left with a file of just the problems that I am interested in. Sort of of like of like this in shell:

grep -v "is OK" ${OUTDIR}/${OUTFILE} | grep -v "is a broken" | grep -v "file could not be opened" | grep -v "is a block" > ${OUTDIR}/${OUTFILE}.trimmed 2>&1

I read in and search through the file here:

import re

f2 = open(outFilePath)
contents = f2.read()
print contents
p = re.compile("^((?!(is OK)|(file could not be opened)| (is a broken)|(is a block)))*$", re.MULTILINE | re.DOTALL)
m = p.findall(contents)
print len(m)
for iter in m:
    print iter
f2.close()

A sample of the file I am trying to search:

eth0
10.0.11.196
00:0C:29:AF:6A:A7
parameters passed to uvscan: --DRIVER /opt/McAfee/uvscan/datfiles/current --    ANALYZE --AFC=32 ATIME-PRESERVE --PLAD --RPTALL RPTOBJECTS SUMMARY --UNZIP -- RECURSIVE --SHOWCOMP --MIME --THREADS=4 /tmp
temp XML output is: /tmp/HIQZRq7t2R
McAfee VirusScan Command Line for Linux64 Version: 6.0.5.614
Copyright (C) 2014 McAfee, Inc.
(408) 988-3832 LICENSED COPY - April 03 2016

AV Engine version: 5700.7163 for Linux64.
Dat set version: 8124 created Apr 3 2016
Scanning for 670707 viruses, trojans and variants.


No file or directory found matching /root/SVN/swd-lhn-build/trunk/utils/ATIME-PRESERVE

No file or directory found matching /root/SVN/swd-lhn-build/trunk/utils/RPTOBJECTS

No file or directory found matching /root/SVN/swd-lhn-build/trunk/utils/SUMMARY
/tmp/tmp.BQshVRSiBo ... is OK.
/tmp/keyring-F6vVGf/socket ... file could not be opened.
/tmp/keyring-F6vVGf/socket.ssh ... file could not be opened.
/tmp/keyring-F6vVGf/socket.pkcs11 ... file could not be opened.
/tmp/yum.log ... is OK.
/tmp/tmp.oW75zGUh4S ... is OK.
/tmp/.X11-unix/X0 ... file could not be opened.
/tmp/tmp.LCZ9Ji6OLs ... is OK.
/tmp/tmp.QdAt1TNQSH ... is OK.
/tmp/ks-script-MqIN9F ... is OK.
/tmp/tmp.mHXPvYeKjb/mcupgrade.conf ... is OK.
/tmp/tmp.mHXPvYeKjb/uvscan/uninstall-uvscan ... is OK.
/tmp/tmp.mHXPvYeKjb/mcscan ... is OK.
/tmp/tmp.mHXPvYeKjb/uvscan/install-uvscan ... is OK.
/tmp/tmp.mHXPvYeKjb/uvscan/readme.txt ... is OK.
/tmp/tmp.mHXPvYeKjb/uvscan/uvscan_secure ... is OK.
/tmp/tmp.mHXPvYeKjb/uvscan/signlic.txt ... is OK.
/tmp/tmp.mHXPvYeKjb/uvscan/uvscan ... is OK.
/tmp/tmp.mHXPvYeKjb/uvscan/liblnxfv.so.4 ... is OK.

But am not getting the correct output. I have tried removing both the MULTILINE and DOTALL options as well and still do not get the correct response. Below is the output when running with DOTALL and MULTILINE.

9
('', '', '', '', '')
('', '', '', '', '')
('', '', '', '', '')
('', '', '', '', '')
('', '', '', '', '')
('', '', '', '', '')
('', '', '', '', '')
('', '', '', '', '')
('', '', '', '', '')

Any help would be much appreciated!! Thanks!!

Upvotes: 0

Views: 614

Answers (4)

Salvador Real
Salvador Real

Reputation: 136

I know it is too late to answer. But I see that no answer is a correct solution.

Your regex for this case is wrong. You have unnecessary additional groups, a period is missing "." Also, it will only match if "is OK|file could not be opened|is a broken" is at the beginning of the sentence.

"hello world is OK": does not match  
"is OK hello world": matches

In a reverse match just use Non-capturing group '(?:)' instead of Capturing group '()'. This is to not get an empty string.

If you want to remove the entire sentence, you can use the following expression:

 r"^(?!.*(?:is OK|is a broken|file could not be opened)).*"
"is OK. hello world": matches  
"hello world is OK.": matches  
"is Ok.": matches

If you want to remove the entire sentence but only the ones ending in "is OK.|File could not be opened.|Is a broken.", You can use the following expression:

r"^(?!.*(?:is OK|is a broken|file could not be opened)\.$).*"
"is OK. hello world" does not match  
"hello world is OK.": matches  
"is Ok.": matches

Remember to use Non-capturing group '(?:)' instead of Capturing group '()', otherwise you will get an empty string:

                #Capturing group
regex = r"^(?!.*(is OK|file could not be opened|is a broken|is a block)).*"
print(re.findall(regex,text,flags=re.MULTILINE))

output:

['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

Use the join() function to get the full text

                #Non-capturing group
regex = r"^(?!.*(?:is OK|file could not be opened|is a broken|is a block)).*"
print("\n".join(re.findall(regex,text,flags=re.MULTILINE)))

output:

eth1
10.0.11.196
00:0C:29:AF:6A:A7
parameters passed to uvscan: --DRIVER /opt/McAfee/uvscan/datfiles/current --    ANALYZE --AFC=32 ATIME-PRESERVE --PLAD --RPTALL RPTOBJECTS SUMMARY --UNZIP -- RECURSIVE --SHOWCOMP --MIME --THREADS=4 /tmp
temp XML output is: /tmp/HIQZRq7t2R
McAfee VirusScan Command Line for Linux64 Version: 6.0.5.614
Copyright (C) 2014 McAfee, Inc.
(408) 988-3832 LICENSED COPY - April 03 2016

AV Engine version: 5700.7163 for Linux64.
Dat set version: 8124 created Apr 3 2016
Scanning for 670707 viruses, trojans and variants.


No file or directory found matching /root/SVN/swd-lhn-build/trunk/utils/ATIME-PRESERVE

No file or directory found matching /root/SVN/swd-lhn-build/trunk/utils/RPTOBJECTS

No file or directory found matching /root/SVN/swd-lhn-build/trunk/utils/SUMMARY

Test it

Upvotes: 0

Dominique Fortin
Dominique Fortin

Reputation: 2238

Try this (and it's done in one line)

p = re.compile("^(?:[if](?!s OK|s a broken|s a block|ile could not be opened)|[^if])*$")

It means that if in a line you have an "i" or an "f" it cannot be followed the suffix mentioned or it's not an "i" or an "f" then it's ok. It repeats that for all the charaters in the line.

Edit: After testing at regex101.com, I found why it was not working. Here is the one line regex that will work.

p = re.compile("^(?:[^if\n]|[if](?!s OK|ile could not be openeds OK|s a broken|s a block|ile could not be opened))*$", re.MULTILINE)

Upvotes: 0

Wayne Werner
Wayne Werner

Reputation: 51807

Sometimes regexes are more complicated, but if you're really only looking for these patterns then I'd probably just try the simple approach:

terms = (
    'is OK',
    'file could not be opened',
    'is a broken',
    'is a block',
)

with open('/tmp/sample.log') as f:
    for line in f:
        if line.strip() and not any(term in line for term in terms):
            print(line, end='')

It might not be faster than the regex, but it's about as simple as it gets. Alternatively you could also use a slightly more strict approach:

terms = (
    'is a broken',
    'is a block',
)

with open('/tmp/samplelog.log') as f:
    for line in f:
        line = line.strip()
        if not line:
            continue
        elif line.endswith('is OK.'):
            continue
        elif line.endswith('file could not be opened.'):
            continue
        elif any(term in line for term in terms):
            continue
        print(line)

The approach I would take largely depends on who I expect to be using the script :)

Upvotes: 0

cdlane
cdlane

Reputation: 41872

Perhaps think simpler, line by line:

import re
import sys

pattern = re.compile(r"(is OK)|(file could not be opened)|(is a broken)|(is a block)")

with open(sys.argv[1]) as handle:
    for line in handle:
        if not pattern.search(line):
            sys.stdout.write(line)

Outputs:

eth0
10.0.11.196
00:0C:29:AF:6A:A7
parameters passed to uvscan: --DRIVER /opt/McAfee/uvscan/datfiles/current --    ANALYZE --AFC=32 ATIME-PRESERVE --PLAD --RPTALL RPTOBJECTS SUMMARY --UNZIP -- RECURSIVE --SHOWCOMP --MIME --THREADS=4 /tmp
temp XML output is: /tmp/HIQZRq7t2R
McAfee VirusScan Command Line for Linux64 Version: 6.0.5.614
Copyright (C) 2014 McAfee, Inc.
(408) 988-3832 LICENSED COPY - April 03 2016

AV Engine version: 5700.7163 for Linux64.
Dat set version: 8124 created Apr 3 2016
Scanning for 670707 viruses, trojans and variants.


No file or directory found matching /root/SVN/swd-lhn-build/trunk/utils/ATIME-PRESERVE

No file or directory found matching /root/SVN/swd-lhn-build/trunk/utils/RPTOBJECTS

No file or directory found matching /root/SVN/swd-lhn-build/trunk/utils/SUMMARY

Upvotes: 2

Related Questions