k5man001
k5man001

Reputation: 107

Adding a column of data if an element from another column is in the dictionary

I have two issues I am trying to resolve,

  1. I want to check in the dictionary frequency4 element by element for each ip address after it gets stored, if that ip address is in column[4] in the lines of data in the text file it will keep adding the amount of bytes of that exact ip in the data file.

  2. If the column[8] under bytes contains an "M" meaning million, it will convert that M into '*1000000' equaling 33000000 (see data from text file below) keep in mind that this is a sample of the text file, the text file contains thousands of lines of data.

The output I am looking for is:

Total bytes for ip 172.217.9.133 is 33000000
Total bytes for ip 205.251.24.253 is 9516
Total bytes for ip 52.197.234.56 is 14546 

CODE

from collections import OrderedDict
from collections import Counter

frequency4 = Counter({})
ttlbytes = 0


with open('/Users/rm/Desktop/nettestWsum.txt', 'r') as infile:    
    next(infile) 
    for line in infile:       
        if "Summary:" in line:
            break
        try:               
            srcip = line.split()[4].rsplit(':', 1)[0]
            frequency4[srcip] = frequency4.get(srcip,0) + 1 
            f4 = OrderedDict(frequency4.most_common())
            for srcip in f4:
                ttlbytes += int(line.split()[8])
        except(ValueError):
            pass 
print("\nTotal bytes for ip",srcip, "is:", ttlbytes)      
for srcip, count in f4.items():    
    print("\nIP address from destination:", srcip, "was found:", count, "times.")

DATA FILE

Date first seen          Duration Proto      Src IP Addr:Port          Dst IP Addr:Port   Packets    Bytes Flows
2017-04-11 07:23:17.880   929.748 UDP      172.217.9.133:443   ->  205.166.231.250:41138     3019    3.3 M     1
2017-04-11 07:38:40.994     6.676 TCP     205.251.24.253:443   ->  205.166.231.250:24723       16     4758     1
2017-04-11 07:38:40.994     6.676 TCP     205.251.24.253:443   ->  205.166.231.250:24723       16     4758     1
2017-04-11 07:38:41.258     6.508 TCP      52.197.234.56:443   ->  205.166.231.250:13712       14     7273     1
2017-04-11 07:38:41.258     6.508 TCP      52.197.234.56:443   ->  205.166.231.250:13712       14     7273     1
Summary: total flows: 22709, total bytes: 300760728, total packets: 477467, avg bps: 1336661, avg pps: 265, avg bpp: 629
Time window: 2017-04-11 07:13:47 - 2017-04-11 07:43:47
Total flows processed: 22709, Blocks skipped: 0, Bytes read: 1544328
Sys: 0.372s flows/second: 61045.7    Wall: 0.374s flows/second: 60574.9

Upvotes: 0

Views: 53

Answers (2)

Impuls3H
Impuls3H

Reputation: 303

Ok I'm not sure if you need to edit the same file..if you're just looking to process the data and view it, you can explore using pandas as it has many functions that quicken data processing.

import pandas as pd
df = pd.read_csv(filepath_or_buffer = '/Users/rm/Desktop/nettestWsum.txt', index_col = False, header = None, skiprows = 1, sep = '\s\s+', skipfooter = 4)
df.drop(labels = 3, axis = 1, inplace = True)
# To drop the -> column
columnnames = 'Date first seen,Duration Proto,Src IP Addr:Port,Dst IP Addr:Port,Packets,Bytes,Flows'
columnnames = columnnames.split(',')
df.columns = columnnames

This loads the data into a nice dataframe (table). I would suggest you read up on the documentation of the pandas.read_csv method here. To process the data, you can try the below.

# converting data with 'M' to numeric data in millions
df['Bytes'] = df['Bytes'].apply(lambda x: float(x[:-2])*1000000 if x[-1] == 'M' else x)
df['Bytes'] = pd.to_numeric(df['Bytes'])
result = df.groupby(by = 'Dst IP Addr:Port').sum()

Your data will come out in a nice dataframe (table) that you can use. It is faster than looping through I think, you can do the testing separately. Below is how the data looks like after being loaded.

DataFrame

Below is the output of the groupby, which you can tweak. I'm using the Spyder IDE and the screengrabs are from the variable explorer in the IDE. You can visualize it by printing the dataframe out or saving it as another CSV.

enter image description here

Upvotes: 0

i don't know what you need the frequency for but given your input here's how to get the desired output:

from collections import Counter

count = Counter()

with open('/Users/rm/Desktop/nettestWsum.txt', 'r') as infile:   
    next(infile)
    for line in infile:      
        if "Summary:" in line:
            break

        parts = line.split()
        srcip = parts[4].rsplit(':', 1)[0]

        multiplier = 10**6 if parts[9] == 'M' else 1
        bytes = int(float(parts[8]) * multiplier)
        count[srcip] += bytes

for srcip, bytes in count.most_common():
    print('Total bytes for ip', srcip, 'is', bytes)

Upvotes: 1

Related Questions