Emka
Emka

Reputation: 23

Advanced parsing with python (Multiple Lines)

I am stuck here with a parsing problem on python 2.7, let me explain:

I am parsing events from the incapsula API. The goal is to make them readable in an excel table, for making stats and graph.

On the signature field, you can read the type of event/attack and a number. The number includes the number of attacks, so I decided to multiply each line by its corresponding sum of attack's number after the 'signature=' field.

Like this capture :

 visit_id=324001290181618591, src_country=Ukraine, event_timestamp=1484493309742, src_ip=91.223.133.30, dest_name=www.xxx.com, dest_id=1551642, signature={api.threats.sql_injection=3}
 visit_id=324001290181618591, src_country=Ukraine, event_timestamp=1484493309742, src_ip=91.223.133.30, dest_name=www.xxx.com, dest_id=1551642, signature={api.threats.sql_injection=3}
 visit_id=324001290181618591, src_country=Ukraine, event_timestamp=1484493309742, src_ip=91.223.133.30, dest_name=www.xxx.com, dest_id=1551642, signature={api.threats.sql_injection=3}

So far everything goes as expected, I got the right count of attacks.

BUT

On some rare events, they are multiple values on the signature field, like this capture :

 visit_id=324001290181618591, src_country=Ukraine, event_timestamp=1484493309742, src_ip=91.223.133.30, dest_name=www.xxx.com, dest_id=1551642, signature={api.threats.sql_injection=3, api.threats.bot_access_control=1, api.threats.illegal_resource_access=1, api.threats.cross_site_scripting=1,}
 visit_id=324001290181618591, src_country=Ukraine, event_timestamp=1484493309742, src_ip=91.223.133.30, dest_name=www.xxx.com, dest_id=1551642, signature={api.threats.sql_injection=3, api.threats.bot_access_control=1, api.threats.illegal_resource_access=1, api.threats.cross_site_scripting=1,}
 visit_id=324001290181618591, src_country=Ukraine, event_timestamp=1484493309742, src_ip=91.223.133.30, dest_name=www.xxx.com, dest_id=1551642, signature={api.threats.sql_injection=3, api.threats.bot_access_control=1, api.threats.illegal_resource_access=1, api.threats.cross_site_scripting=1,}
 visit_id=324001290181618591, src_country=Ukraine, event_timestamp=1484493309742, src_ip=91.223.133.30, dest_name=www.xxx.com, dest_id=1551642, signature={api.threats.sql_injection=3, api.threats.bot_access_control=1, api.threats.illegal_resource_access=1, api.threats.cross_site_scripting=1,}
 visit_id=324001290181618591, src_country=Ukraine, event_timestamp=1484493309742, src_ip=91.223.133.30, dest_name=www.xxx.com, dest_id=1551642, signature={api.threats.sql_injection=3, api.threats.bot_access_control=1, api.threats.illegal_resource_access=1, api.threats.cross_site_scripting=1,}
 visit_id=324001290181618591, src_country=Ukraine, event_timestamp=1484493309742, src_ip=91.223.133.30, dest_name=www.xxx.com, dest_id=1551642, signature={api.threats.sql_injection=3, api.threats.bot_access_control=1, api.threats.illegal_resource_access=1, api.threats.cross_site_scripting=1,}
 visit_id=86001060468746692, src_country=Netherlands, event_timestamp=1483867285054, src_ip=178.22.232.53, dest_name=www.yyy.com, dest_id=1551642, signature={api.threats.bot_access_control=1, api.threats.illegal_resource_access=3,}
 visit_id=86001060468746692, src_country=Netherlands, event_timestamp=1483867285054, src_ip=178.22.232.53, dest_name=www.yyy.com, dest_id=1551642, signature={api.threats.bot_access_control=1, api.threats.illegal_resource_access=3,}
 visit_id=86001060468746692, src_country=Netherlands, event_timestamp=1483867285054, src_ip=178.22.232.53, dest_name=www.yyy.com, dest_id=1551642, signature={api.threats.bot_access_control=1, api.threats.illegal_resource_access=3,}
 visit_id=86001060468746692, src_country=Netherlands, event_timestamp=1483867285054, src_ip=178.22.232.53, dest_name=www.yyy.com, dest_id=1551642, signature={api.threats.bot_access_control=1, api.threats.illegal_resource_access=3,}

I still got the right count of attacks on those rare lines too, but I want to arrange the signature field from this :

signature={api.threats.sql_injection=3, api.threats.bot_access_control=1, api.threats.illegal_resource_access=1, api.threats.cross_site_scripting=1,}
signature={api.threats.sql_injection=3, api.threats.bot_access_control=1, api.threats.illegal_resource_access=1, api.threats.cross_site_scripting=1,}
signature={api.threats.sql_injection=3, api.threats.bot_access_control=1, api.threats.illegal_resource_access=1, api.threats.cross_site_scripting=1,}
signature={api.threats.sql_injection=3, api.threats.bot_access_control=1, api.threats.illegal_resource_access=1, api.threats.cross_site_scripting=1,}
signature={api.threats.sql_injection=3, api.threats.bot_access_control=1, api.threats.illegal_resource_access=1, api.threats.cross_site_scripting=1,}
signature={api.threats.sql_injection=3, api.threats.bot_access_control=1, api.threats.illegal_resource_access=1, api.threats.cross_site_scripting=1,}
signature={api.threats.bot_access_control=1, api.threats.illegal_resource_access=3,}
signature={api.threats.bot_access_control=1, api.threats.illegal_resource_access=3,}
signature={api.threats.bot_access_control=1, api.threats.illegal_resource_access=3,}
signature={api.threats.bot_access_control=1, api.threats.illegal_resource_access=3,}

To this:

signature={api.threats.sql_injection}
signature={api.threats.sql_injection}
signature={api.threats.sql_injection}
signature={api.threats.bot_access_control}
signature={api.threats.illegal_resource_access}
signature={api.threats.cross_site_scripting}
signature={api.threats.bot_access_control}
signature={api.threats.illegal_resource_access}
signature={api.threats.illegal_resource_access}
signature={api.threats.illegal_resource_access}

(the first six lines are the first event duplicated 6 times (3+1+1+1 =6), the last 4 are the second event duplicated 4 times (1+3=4)

My current source code:

#count the number of attack per line
f = open('monthlyLogShort.txt','r')
g = open("count.txt", 'w')
kensu = f.readlines()
f.close()
for line in kensu:
        st = line.find('signature=')
        end = line.find('}')
        unprecise = line[st:end+1]
        #count = int(re.search(r'\d+', unprecise).group())
        count = sum(map(int,re.findall(r'[0-9]+', unprecise)))
        print >> g, count

g.close()

#replicate lines according to the number of attack            
h = open('flog.txt','w')

with open("monthlyLogShort.txt") as textfile1, open("count.txt") as textfile2:
    for x, y in izip(textfile1, textfile2):
        x = x.strip()
        y = y.strip()
        print >> h, x * int(y)
h.close()

Upvotes: 2

Views: 179

Answers (1)

Stephen Rauch
Stephen Rauch

Reputation: 49774

If I read your requirements correctly, you are trying to emit one line for each threat occurrence while preserving the rest of the record. This solution does not output the counts directly, instead it transforms the data so that it is uniformly one threat per line.

Code:

sig_str = 'signature={'
for line in kensu:
    record, signature = line.split(sig_str)
    threats = signature.split('}')[0]
    for counts in threats.split(','):
        if '=' in counts:
            threat, count = tuple(counts.split('='))
            for i in range(int(count)):
                print '%s%s%s}' % (record, sig_str, threat.strip())

Sample data:

kensu = [x.strip() for x in """
    record=0, signature={api.threats.sql_injection=1}
    record=1, signature={api.threats.sql_injection=3, api.threats.bot_access_control=1, api.threats.illegal_resource_access=1, api.threats.cross_site_scripting=1,}
    record=2, signature={api.threats.bot_access_control=1, api.threats.illegal_resource_access=3,}
""".split('\n')[1:-1]]

Output:

record=0, signature={api.threats.sql_injection}
record=1, signature={api.threats.sql_injection}
record=1, signature={api.threats.sql_injection}
record=1, signature={api.threats.sql_injection}
record=1, signature={api.threats.bot_access_control}
record=1, signature={api.threats.illegal_resource_access}
record=1, signature={api.threats.cross_site_scripting}
record=2, signature={api.threats.bot_access_control}
record=2, signature={api.threats.illegal_resource_access}
record=2, signature={api.threats.illegal_resource_access}
record=2, signature={api.threats.illegal_resource_access}

Upvotes: 1

Related Questions