Reputation: 242
For my project, I need to read file and match it with my constants and once matches, need to store them in a dictionary. I am going to show a sample of my data and what I have so far below.
My data:
TIMESTAMP: 1579051725 20100114-202845
.1.2.3.4.5.6.7.8.9 = 234567890
ifTb: name-nam-na
.1.3.4.1.2.1.1.1.1.1.1.128 = STRING: AA1
.1.3.4.1.2.1.1.1.1.1.1.129 = STRING: Eth1
.1.3.4.1.2.1.1.1.1.1.1.130 = STRING: Eth2
This data has 5 important parts I want to gather:
Date right after timestamp: 1579051725
Num
(first part of the numbers until 128, 129, 130,etc): .1.3.4.1.2.1.1.1.1.1.1
Num2
(second part): 128
or 129
or 130
or others in my larger data set
Syntax
: In this case it is named: STRING
Counter
: In this case they are strings; AA1
or Eth1
or Eth2
I also have (need to have) constant Num
as dictionary within the program that holds the value above and constant syntax
I want to read through the data file,
If Num
matches the constant I have within the program,
grab Num2
,
check if Syntax
matches the constant syntax
within the program
grab Counter
When I say grab, I mean put that data under corresponding dictionary.
In short, I want to read through the data file, split 5 variables within it, match 2 variables with constant dictionary values, and grab and store 3 variables (including time) under dictionary.
I have trouble with splitting the data as of right now. I can split everything except Num
and Num2
. Also I am not sure how to create the constant dictionaries and how I should put under the constant dictionaries.
I would love to use regular expression instead of using if statement, but could not figure out what symbols to use since data includes many dots within the words.
I have the following so far:
constant_dic1 = {[".1.3.4.1.2.1.1.1.1.1.1"]["STRING" ]}
data_cols = {'InterfaceNum':[],"IndexNum":[],"SyntaxName":[],"Counter":[],"TimeStamp":[]}
fileN = args.File_Name
with open (fileN, 'r') as f:
for lines in f:
if lines.startswith('.'):
if ': ' in lines:
lines=lines.split("=")
first_part = lines[0].split()
second_part = lines[1].split()
for i in first_part:
f_f = i.split("{}.{}.{}.{}.{}.{}.{}.{}.{}.{}.{}.")
print (f_f[0])
Once I run the program, I receive the error that that "TypeError: list indices must be integers or slices, not str".
When I comment out the dictionary part, output is Num
as well as Num2
. It does not get split and does not print just the Num
part.
Any help is appreciated! If there's any other source, please let me know below. Please let me know if I need any updates on the question without down voting. Thanks!
UPDATED CODE
import pandas as pd
import io
import matplotlib
matplotlib.use('TkAgg') # backend option for matplotlib #TkAgg #Qt4Agg #Qt5Agg
import matplotlib.pyplot as plt
import re # regular expression
import argparse # for optional arguments
parser = argparse.ArgumentParser()
parser.add_argument('File_Name', help="Enter the file name | At least one file is required to graph")
args=parser.parse_args()
data_cols = {'InterfaceNum':[],"IndexNum":[],"SyntaxName":[],"Counter":[],"TimeStamp":[]}
fileN = args.File_Name
input_data = fileN
expr = r"""
TIMESTAMP:\s(\d+) # date - TimeStamp
| # ** OR **
((?:\.\d+)+) # num - InterfaceNum
\.(\d+)\s=\s # num2 - IndexNum
(\w+):\s # syntax - SyntaxName
(\w+) # counter - Counter
"""
expr = re.compile(expr, re.VERBOSE)
data = {}
keys = ['TimeStamp', 'InterfaceNum', 'IndexNum', 'SyntaxName', 'Counter']
with io.StringIO(input_data) as data_file:
for line in data_file:
try:
find_data = expr.findall(line)[0]
vals = [date, num, num2, syntax, counter] = list(find_data)
if date:
cur_date = date
data[cur_date] = {k: [] for k in keys}
elif num:
vals[0] = cur_date
for k, v in zip(keys, vals):
data[cur_date][k].append(v)
except IndexError:
# expr.findall(...)[0] indexes an empty list when there's no
# match.
pass
data_frames = [pd.DataFrame.from_dict(v) for v in data.values()]
print(data_frames[0])
ERROR I GET
Traceback (most recent call last):
File "v1.py", line 47, in <module>
print(data_frames[0])
IndexError: list index out of range
NEW DATA
TIMESTAMP: 1579051725 20100114-202845
.1.2.3.4.5.6.7.8.9 = 234567890
ifTb: name-nam-na
.1.3.4.1.2.1.1.1.1.1.1.128 = STRING: AA1
.1.3.4.1.2.1.1.1.1.1.1.129 = STRING: Eth1
.1.3.4.1.2.1.1.1.1.1.1.130 = STRING: Eth2
.1.2.3.4.5.6.7.8.9.10.11.131 = INT32: A
UPDATED CODE (v2)
import pandas as pd
import io
import matplotlib
import re # regular expression
file = r"/home/rusif.eyvazli/Python_Projects/network-switch-packet-loss/s_data.txt"
def get_dev_data(file_path, timestamp=None, iface_num=None, idx_num=None,
syntax=None, counter=None):
timestamp = timestamp or r'\d+'
iface_num = iface_num or r'(?:\.\d+)+'
idx_num = idx_num or r'\d+'
syntax = syntax or r'\w+'
counter = counter or r'\w+'
# expr = r"""
# TIMESTAMP:\s({timestamp}) # date - TimeStamp
# | # ** OR **
# ({iface_num}) # num - InterfaceNum
# \.({idx_num})\s=\s # num2 - IndexNum
# ({syntax}):\s # syntax - SyntaxName
# ({counter}) # counter - Counter
# """
expr = r"TIMESTAMP:\s(\d+)|((?:\.\d+)+)\.(\d+)\s=\s(\w+):\s(\w+)"
# expr = re.compile(expr, re.VERBOSE)
expr = re.compile(expr)
rows = []
keys = ['TimeStamp', 'InterfaceNum', 'IndexNum', 'SyntaxName', 'Counter']
cols = {k: [] for k in keys}
with open(file_path, 'r') as data_file:
for line in data_file:
try:
find_data = expr.findall(line)[0]
vals = [tstamp, num, num2, sntx, ctr] = list(find_data)
if tstamp:
cur_tstamp = tstamp
elif num:
vals[0] = cur_tstamp
rows.append(vals)
for k, v in zip(keys, vals):
cols[k].append(v)
except IndexError:
# expr.findall(line)[0] indexes an empty list when no match.
pass
return rows, cols
const_num = '.1.3.4.1.2.1.1.1.1.1.1'
const_syntax = 'STRING'
result_5 = get_dev_data(file)
# Use the results of the first dict retrieved to initialize the master
# dictionary.
master_dict = result_5[1]
df = pd.DataFrame.from_dict(master_dict)
df = df.loc[(df['InterfaceNum'] == '.1.2.3.4.5.6.7.8.9.10.11') & (df['SyntaxName'] == 'INT32' )]
print(f"\n{df}")
OUTPUT
TimeStamp InterfaceNum IndexNum SyntaxName Counter
3 1579051725 .1.2.3.4.5.6.7.8.9.10.11 131 INT32 A
Upvotes: 0
Views: 752
Reputation: 5395
Parsing raw file input using Regular Expressions
The function below is an example of how to parse raw file input with regular expressions.
The regular expression capture groups are looped over to build records. This is a reusable pattern that can be applied in many cases. There's more info on how it works in the 'Groupings in compound regular expressions' section.
The function will filter records that match the parameter values. Leaving them to their defaults, the function returns all the rows of data.
def get_dev_data(file_path, timestamp=None, iface_num=None, idx_num=None,
syntax=None, counter=None):
timestamp = timestamp or r'\d+'
iface_num = iface_num or r'(?:\.\d+)+'
idx_num = idx_num or r'\d+'
syntax = syntax or r'\w+'
counter = counter or r'\w+'
expr = rf"""
TIMESTAMP:\s({timestamp}) # date - TimeStamp
| # ** OR **
({iface_num}) # num - InterfaceNum
\.({idx_num})\s=\s # num2 - IndexNum
({syntax}):\s # syntax - SyntaxName
({counter}) # counter - Counter
"""
expr = re.compile(expr, re.VERBOSE)
rows = []
keys = ['TimeStamp', 'InterfaceNum', 'IndexNum', 'SyntaxName', 'Counter']
cols = {k: [] for k in keys}
with open(file_path, 'r') as data_file:
for line in data_file:
try:
find_data = expr.findall(line)[0]
vals = [tstamp, num, num2, sntx, ctr] = list(find_data)
if tstamp:
cur_tstamp = tstamp
elif num:
vals[0] = cur_tstamp
rows.append(vals)
for k, v in zip(keys, vals):
cols[k].append(v)
except IndexError:
# expr.findall(line)[0] indexes an empty list when no match.
pass
return rows, cols
A tuple is returned. The first item, rows
, is a list of rows of data in simple format; the second item, cols
, is a dictionary keyed by column name with a list of row data per key. Both contain the same data and are each digestible by Pandas with pd.DataFrame.from_records()
or pd.DataFrame.from_dict()
respectively.
filtering example
This shows how records can be filtered using the function parameters. I think the last one, result_4
, fits the description in the question. Assume that iface_num
is set to your const_num
, and syntax
to your const_syntax
values. Only records that match will be returned.
if __name__ == '__main__':
file = r"/test/inputdata.txt"
result_1 = get_dev_data(file)[0]
result_2 = get_dev_data(file, counter='Eth2')[0]
result_3 = get_dev_data(file, counter='Eth2|AA1')[0]
result_4 = get_dev_data(file,
iface_num='.1.3.4.1.2.1.1.1.1.1.1', syntax='STRING')[0]
for var_name, var_val in zip(['result_1', 'result_2', 'result_3', 'result_4'],
[ result_1, result_2, result_3, result_4]):
print(f"{var_name} = {var_val}")
Output
result_1 = [['1579051725', '.1.3.4.1.2.1.1.1.1.1.1', '128', 'STRING', 'AA1'], ['1579051725', '.1.3.4.1.2.1.1.1.1.1.1', '129', 'STRING', 'Eth1'], ['1579051725', '.1.3.4.1.2.1.1.1.1.1.1', '130', 'STRING', 'Eth2']]
result_2 = [['1579051725', '.1.3.4.1.2.1.1.1.1.1.1', '130', 'STRING', 'Eth2']]
result_3 = [['1579051725', '.1.3.4.1.2.1.1.1.1.1.1', '128', 'STRING', 'AA1'], ['1579051725', '.1.3.4.1.2.1.1.1.1.1.1', '130', 'STRING', 'Eth2']]
result_4 = [['1579051725', '.1.3.4.1.2.1.1.1.1.1.1', '128', 'STRING', 'AA1'], ['1579051725', '.1.3.4.1.2.1.1.1.1.1.1', '129', 'STRING', 'Eth1'], ['1579051725', '.1.3.4.1.2.1.1.1.1.1.1', '130', 'STRING', 'Eth2']]
Using the first returned tuple item, column data can be accessed from the returned records using their offsets. For instance TimeStamp
would be accessed like first_item[0][0]
- first row, first column. Or, the rows can be converted into a dataframe and accessed that way.
Input file /test/inputdata.txt
TIMESTAMP: 1579051725 20100114-202845
.1.2.3.4.5.6.7.8.9 = 234567890
ifTb: name-nam-na
.1.3.4.1.2.1.1.1.1.1.1.128 = STRING: AA1
.1.3.4.1.2.1.1.1.1.1.1.129 = STRING: Eth1
.1.3.4.1.2.1.1.1.1.1.1.130 = STRING: Eth2
Convert row data into a Pandas dataframe
The first tuple item in the output of the function will be rows of data corresponding to columns we've defined. This format can be converted into a Pandas dataframe using pd.DataFrame.from_records()
:
>>> row_data = [['1579051725', '.1.3.4.1.2.1.1.1.1.1.1', '128', 'STRING', 'AA1']]]
>>>
>>> column_names = ['TimeStamp', 'InterfaceNum', 'IndexNum',
... 'SyntaxName', 'Counter']
>>>
>>> pd.DataFrame.from_records(row_data, columns=column_names)
TimeStamp InterfaceNum IndexNum SyntaxName Counter
0 1579051725 .1.3.4.1.2.1.1.1.1.1.1 128 STRING AA1
>>>
Convert column data into a Pandas dataframe
The function also produces a dictionary as the second item of the returned tuple containing the same data, which could also produce the same dataframe using pd.DataFrame.from_dict()
.
>>> col_data = {'TimeStamp': ['1579051725'],
... 'InterfaceNum': ['.1.3.4.1.2.1.1.1.1.1.1'],
... 'IndexNum': ['128'], 'SyntaxName': ['STRING'],
... 'Counter': ['AA1']}
>>>
>>> pd.DataFrame.from_dict(col_data)
TimeStamp InterfaceNum IndexNum SyntaxName Counter
0 1579051725 .1.3.4.1.2.1.1.1.1.1.1 128 STRING AA1
>>>
Dictionary example
Here are a few examples of filtering file data, initializing a persistent dictionary. Then filtering for more data and adding it to the persistent dictionary. I think this is also close to what's described in the question.
const_num = '.1.3.4.1.2.1.1.1.1.1.1'
const_syntax = 'STRING'
result_5 = get_dev_data(file, iface_num=const_num, syntax=const_syntax)
# Use the results of the first dict retrieved to initialize the master
# dictionary.
master_dict = result_5[1]
print(f"master_dict = {master_dict}")
result_6 = get_dev_data(file, counter='Eth2|AA1')
# Add more records to the master dictionary.
for k, v in result_6[1].items():
master_dict[k].extend(v)
print(f"master_dict = {master_dict}")
df = pandas.DataFrame.from_dict(master_dict)
print(f"\n{df}")
Output
master_dict = {'TimeStamp': ['1579051725', '1579051725', '1579051725'], 'InterfaceNum': ['.1.3.4.1.2.1.1.1.1.1.1', '.1.3.4.1.2.1.1.1.1.1.1', '.1.3.4.1.2.1.1.1.1.1.1'], 'IndexNum': ['128', '129', '130'], 'SyntaxName': ['STRING', 'STRING', 'STRING'], 'Counter': ['AA1', 'Eth1', 'Eth2']}
master_dict = {'TimeStamp': ['1579051725', '1579051725', '1579051725', '1579051725', '1579051725'], 'InterfaceNum': ['.1.3.4.1.2.1.1.1.1.1.1', '.1.3.4.1.2.1.1.1.1.1.1', '.1.3.4.1.2.1.1.1.1.1.1', '.1.3.4.1.2.1.1.1.1.1.1', '.1.3.4.1.2.1.1.1.1.1.1'], 'IndexNum': ['128', '129', '130', '128', '130'], 'SyntaxName': ['STRING', 'STRING', 'STRING', 'STRING', 'STRING'], 'Counter': ['AA1', 'Eth1', 'Eth2', 'AA1', 'Eth2']}
TimeStamp InterfaceNum IndexNum SyntaxName Counter
0 1579051725 .1.3.4.1.2.1.1.1.1.1.1 128 STRING AA1
1 1579051725 .1.3.4.1.2.1.1.1.1.1.1 129 STRING Eth1
2 1579051725 .1.3.4.1.2.1.1.1.1.1.1 130 STRING Eth2
3 1579051725 .1.3.4.1.2.1.1.1.1.1.1 128 STRING AA1
4 1579051725 .1.3.4.1.2.1.1.1.1.1.1 130 STRING Eth2
If all columns of the dictionary data aren't needed, keys in it can be dispensed with using <dict>.pop(<key>)
. Or you could drop columns from any dataframe created off the data.
Groupings in compound regular expressions
This expression shows the expression that's evaluated in the function when all its parameters are left to their default values.
expr = r"""
TIMESTAMP:\s(\d+) # date - TimeStamp
| # ** OR **
((?:\.\d+)+) # num - InterfaceNum
\.(\d+)\s=\s # num2 - IndexNum
(\w+):\s # syntax - SyntaxName
(\w+) # counter - Counter
"""
In the regular expression above, there are two alternative statements separated by the OR, |
operator. These alternatives match either a line of timestamp data, or device data. And within these subexpressions are groupings to capture specific pieces of the string data. Match groups are created by putting parenthesis, (...)
, around a subexpression. The syntax for non-grouping parenthesis is (?:...)
.
No matter which alternative subexpression matches, there will still be the same number of match groups returned per successful call to re.findall()
. Maybe a bit counterintuitive, but this is just how it works.
However, this feature does make it easy to write code to extract which fields of the match you've captured since you know the positions the groups should be at regardless of the subexpression matched:
[<tstamp>, <num>, <num2>, <syntax>, <counter>]
# ^expr1^ ^.............expr2..............^
And since we have a predictable number of match groups regardless of which subexpression matches, it enables a pattern of looping that can be applied in many scenarios. By testing whether single match groups are empty or not, we know which branch within the loop to take to process the data for whichever subexpression got the hit.
if tstamp:
# First expression hit.
elif num:
# Second alt expression hit.
When the expression matches against the line of text that has the timestamp, the first subexpression hits, and its groups will be populated.
>>> re.findall(expr, "TIMESTAMP: 1579051725 20100114-202845", re.VERBOSE)
[('1579051725', '', '', '', '')]
Here, the first grouping from the expression is filled in and the other groups are blank. The other groupings belong to the other subexpression.
Now when the expression matches against the first line of device data, the second subexpression gets a hit, and its groups are populated. The timestamp groups are blank.
>>> re.findall(expr, ".1.3.4.1.2.1.1.1.1.1.1.128 = STRING: AA1", re.VERBOSE)
[('', '.1.3.4.1.2.1.1.1.1.1.1', '128', 'STRING', 'AA1')]
And finally, when neither subexpression matches, then the entire expression doesn't get a hit. In this case we get an empty list.
>>> re.findall(expr, "ifTb: name-nam-na", re.VERBOSE)
[]
>>>
For contrast, here's the expression without verbose syntax and documentation:
expr = r"TIMESTAMP:\s(\d+)|((?:\.\d+)+)\.(\d+)\s=\s(\w+):\s(\w+)"
Upvotes: 1
Reputation:
Please use the python "re" package for using regular expression in python. This package makes it so easy to use regular expression in python you can use various functions that are inside this package to achieve what you need. https://docs.python.org/3/library/re.html#module-contents use this link to read the documents.
There is a function called re.Pattern.match() which can be used to match patterns as you need to try this out.
Upvotes: 1