Reading data using regular expression

Question

For my project, I need to read file and match it with my constants and once matches, need to store them in a dictionary. I am going to show a sample of my data and what I have so far below.

My data:

TIMESTAMP: 1579051725 20100114-202845
.1.2.3.4.5.6.7.8.9 = 234567890
ifTb: name-nam-na
.1.3.4.1.2.1.1.1.1.1.1.128 = STRING: AA1
.1.3.4.1.2.1.1.1.1.1.1.129 = STRING: Eth1
.1.3.4.1.2.1.1.1.1.1.1.130 = STRING: Eth2

This data has 5 important parts I want to gather:

Date right after timestamp: 1579051725
Num (first part of the numbers until 128, 129, 130,etc): .1.3.4.1.2.1.1.1.1.1.1
Num2 (second part): 128 or 129 or 130 or others in my larger data set
Syntax: In this case it is named: STRING
Counter: In this case they are strings; AA1 or Eth1 or Eth2

I also have (need to have) constant Num as dictionary within the program that holds the value above and constant syntax

I want to read through the data file,

If Num matches the constant I have within the program,
grab Num2,
check if Syntax matches the constant syntax within the program
grab Counter

When I say grab, I mean put that data under corresponding dictionary.

In short, I want to read through the data file, split 5 variables within it, match 2 variables with constant dictionary values, and grab and store 3 variables (including time) under dictionary.

I have trouble with splitting the data as of right now. I can split everything except Num and Num2. Also I am not sure how to create the constant dictionaries and how I should put under the constant dictionaries.

I would love to use regular expression instead of using if statement, but could not figure out what symbols to use since data includes many dots within the words.

I have the following so far:

constant_dic1 = {[".1.3.4.1.2.1.1.1.1.1.1"]["STRING" ]}
data_cols = {'InterfaceNum':[],"IndexNum":[],"SyntaxName":[],"Counter":[],"TimeStamp":[]}
fileN = args.File_Name
with open (fileN, 'r') as f:

    for lines in f:
        if lines.startswith('.'):
            if ': ' in lines:
                lines=lines.split("=")
                first_part = lines[0].split()
                second_part = lines[1].split()
                for i in first_part:
                    f_f = i.split("{}.{}.{}.{}.{}.{}.{}.{}.{}.{}.{}.")
                print (f_f[0])

Once I run the program, I receive the error that that "TypeError: list indices must be integers or slices, not str".

When I comment out the dictionary part, output is Num as well as Num2. It does not get split and does not print just the Num part.

Any help is appreciated! If there's any other source, please let me know below. Please let me know if I need any updates on the question without down voting. Thanks!

UPDATED CODE

import pandas as pd
import io
import matplotlib
matplotlib.use('TkAgg') # backend option for matplotlib #TkAgg #Qt4Agg #Qt5Agg
import matplotlib.pyplot as plt
import re # regular expression
import argparse # for optional arguments
parser = argparse.ArgumentParser()
parser.add_argument('File_Name', help="Enter the file name | At least one file is required to graph")
args=parser.parse_args()

data_cols = {'InterfaceNum':[],"IndexNum":[],"SyntaxName":[],"Counter":[],"TimeStamp":[]}
fileN = args.File_Name
input_data = fileN
expr = r"""
    TIMESTAMP:\s(\d+)           # date    - TimeStamp
    |                           # ** OR **
    ((?:\.\d+)+)                # num     - InterfaceNum
        \.(\d+)\s=\s            # num2    - IndexNum
            (\w+):\s            # syntax  - SyntaxName
                (\w+)           # counter - Counter
    """
expr = re.compile(expr, re.VERBOSE)
data = {}
keys = ['TimeStamp', 'InterfaceNum', 'IndexNum', 'SyntaxName', 'Counter']


with io.StringIO(input_data) as data_file:
    for line in data_file:
        try:
            find_data = expr.findall(line)[0]
            vals = [date, num, num2, syntax, counter] = list(find_data)
            if date:
                cur_date = date
                data[cur_date] = {k: [] for k in keys}
            elif num:
                vals[0] = cur_date
                for k, v in zip(keys, vals):
                    data[cur_date][k].append(v)
        except IndexError:
            # expr.findall(...)[0] indexes an empty list when there's no
            # match.
            pass

data_frames = [pd.DataFrame.from_dict(v) for v in data.values()]

print(data_frames[0])

ERROR I GET

Traceback (most recent call last):
  File "v1.py", line 47, in 
    print(data_frames[0])
IndexError: list index out of range

NEW DATA

TIMESTAMP: 1579051725 20100114-202845
.1.2.3.4.5.6.7.8.9 = 234567890
ifTb: name-nam-na
.1.3.4.1.2.1.1.1.1.1.1.128 = STRING: AA1
.1.3.4.1.2.1.1.1.1.1.1.129 = STRING: Eth1
.1.3.4.1.2.1.1.1.1.1.1.130 = STRING: Eth2
.1.2.3.4.5.6.7.8.9.10.11.131 = INT32: A

UPDATED CODE (v2)

import pandas as pd
import io
import matplotlib
import re # regular expression

file = r"/home/rusif.eyvazli/Python_Projects/network-switch-packet-loss/s_data.txt"



def get_dev_data(file_path, timestamp=None, iface_num=None, idx_num=None, 
                 syntax=None, counter=None):

    timestamp = timestamp or r'\d+'
    iface_num = iface_num or r'(?:\.\d+)+'
    idx_num   = idx_num   or r'\d+'
    syntax    = syntax    or r'\w+'
    counter   = counter   or r'\w+'

#     expr = r"""
#         TIMESTAMP:\s({timestamp})   # date    - TimeStamp
#         |                           # ** OR **
#         ({iface_num})               # num     - InterfaceNum
#             \.({idx_num})\s=\s      # num2    - IndexNum
#                 ({syntax}):\s       # syntax  - SyntaxName
#                     ({counter})     # counter - Counter
#         """

    expr = r"TIMESTAMP:\s(\d+)|((?:\.\d+)+)\.(\d+)\s=\s(\w+):\s(\w+)"

#   expr = re.compile(expr, re.VERBOSE)

    expr = re.compile(expr)

    rows = []
    keys = ['TimeStamp', 'InterfaceNum', 'IndexNum', 'SyntaxName', 'Counter']
    cols = {k: [] for k in keys}

    with open(file_path, 'r') as data_file:
        for line in data_file:
            try:

                find_data = expr.findall(line)[0]
                vals = [tstamp, num, num2, sntx, ctr] = list(find_data)
                if tstamp:
                    cur_tstamp = tstamp
                elif num:
                    vals[0] = cur_tstamp
                    rows.append(vals)
                    for k, v in zip(keys, vals):
                         cols[k].append(v)
            except IndexError:
                # expr.findall(line)[0] indexes an empty list when no match.
                pass

    return rows, cols

const_num    = '.1.3.4.1.2.1.1.1.1.1.1'
const_syntax = 'STRING'

result_5 = get_dev_data(file)

# Use the results of the first dict retrieved to initialize the master
# dictionary.
master_dict = result_5[1]

df = pd.DataFrame.from_dict(master_dict)

df = df.loc[(df['InterfaceNum'] == '.1.2.3.4.5.6.7.8.9.10.11') & (df['SyntaxName'] == 'INT32' )] 

print(f"
{df}")

OUTPUT

    TimeStamp              InterfaceNum IndexNum SyntaxName Counter
3  1579051725  .1.2.3.4.5.6.7.8.9.10.11      131      INT32       A

Todd · Accepted Answer

Parsing raw file input using Regular Expressions

The function below is an example of how to parse raw file input with regular expressions.

The regular expression capture groups are looped over to build records. This is a reusable pattern that can be applied in many cases. There's more info on how it works in the 'Groupings in compound regular expressions' section.

The function will filter records that match the parameter values. Leaving them to their defaults, the function returns all the rows of data.

def get_dev_data(file_path, timestamp=None, iface_num=None, idx_num=None, 
                 syntax=None, counter=None):
    timestamp = timestamp or r'\d+'
    iface_num = iface_num or r'(?:\.\d+)+'
    idx_num   = idx_num   or r'\d+'
    syntax    = syntax    or r'\w+'
    counter   = counter   or r'\w+'
    expr = rf"""
        TIMESTAMP:\s({timestamp})   # date    - TimeStamp
        |                           # ** OR **
        ({iface_num})               # num     - InterfaceNum
            \.({idx_num})\s=\s      # num2    - IndexNum
                ({syntax}):\s       # syntax  - SyntaxName
                    ({counter})     # counter - Counter
        """
    expr = re.compile(expr, re.VERBOSE)
    rows = []
    keys = ['TimeStamp', 'InterfaceNum', 'IndexNum', 'SyntaxName', 'Counter']
    cols = {k: [] for k in keys}

    with open(file_path, 'r') as data_file:
        for line in data_file:
            try:
                find_data = expr.findall(line)[0]
                vals = [tstamp, num, num2, sntx, ctr] = list(find_data)
                if tstamp:
                    cur_tstamp = tstamp
                elif num:
                    vals[0] = cur_tstamp
                    rows.append(vals)
                    for k, v in zip(keys, vals):
                        cols[k].append(v)
            except IndexError:
                # expr.findall(line)[0] indexes an empty list when no match.
                pass
    return rows, cols

A tuple is returned. The first item, rows, is a list of rows of data in simple format; the second item, cols, is a dictionary keyed by column name with a list of row data per key. Both contain the same data and are each digestible by Pandas with pd.DataFrame.from_records() or pd.DataFrame.from_dict() respectively.

filtering example

This shows how records can be filtered using the function parameters. I think the last one, result_4, fits the description in the question. Assume that iface_num is set to your const_num, and syntax to your const_syntax values. Only records that match will be returned.

if __name__ == '__main__':

    file = r"/test/inputdata.txt"

    result_1 = get_dev_data(file)[0]
    result_2 = get_dev_data(file, counter='Eth2')[0]
    result_3 = get_dev_data(file, counter='Eth2|AA1')[0]
    result_4 = get_dev_data(file,
                           iface_num='.1.3.4.1.2.1.1.1.1.1.1', syntax='STRING')[0]

    for var_name, var_val in zip(['result_1', 'result_2', 'result_3', 'result_4'],
                                 [ result_1,   result_2,   result_3,   result_4]):

        print(f"{var_name} = {var_val}")

Output

result_1 = [['1579051725', '.1.3.4.1.2.1.1.1.1.1.1', '128', 'STRING', 'AA1'], ['1579051725', '.1.3.4.1.2.1.1.1.1.1.1', '129', 'STRING', 'Eth1'], ['1579051725', '.1.3.4.1.2.1.1.1.1.1.1', '130', 'STRING', 'Eth2']]
result_2 = [['1579051725', '.1.3.4.1.2.1.1.1.1.1.1', '130', 'STRING', 'Eth2']]
result_3 = [['1579051725', '.1.3.4.1.2.1.1.1.1.1.1', '128', 'STRING', 'AA1'], ['1579051725', '.1.3.4.1.2.1.1.1.1.1.1', '130', 'STRING', 'Eth2']]
result_4 = [['1579051725', '.1.3.4.1.2.1.1.1.1.1.1', '128', 'STRING', 'AA1'], ['1579051725', '.1.3.4.1.2.1.1.1.1.1.1', '129', 'STRING', 'Eth1'], ['1579051725', '.1.3.4.1.2.1.1.1.1.1.1', '130', 'STRING', 'Eth2']]

Using the first returned tuple item, column data can be accessed from the returned records using their offsets. For instance TimeStamp would be accessed like first_item[0][0] - first row, first column. Or, the rows can be converted into a dataframe and accessed that way.

Input file /test/inputdata.txt

TIMESTAMP: 1579051725 20100114-202845
.1.2.3.4.5.6.7.8.9 = 234567890
ifTb: name-nam-na
.1.3.4.1.2.1.1.1.1.1.1.128 = STRING: AA1
.1.3.4.1.2.1.1.1.1.1.1.129 = STRING: Eth1
.1.3.4.1.2.1.1.1.1.1.1.130 = STRING: Eth2

Convert row data into a Pandas dataframe

The first tuple item in the output of the function will be rows of data corresponding to columns we've defined. This format can be converted into a Pandas dataframe using pd.DataFrame.from_records():

>>> row_data = [['1579051725', '.1.3.4.1.2.1.1.1.1.1.1', '128', 'STRING', 'AA1']]]
>>>
>>> column_names = ['TimeStamp', 'InterfaceNum', 'IndexNum', 
...                 'SyntaxName', 'Counter']
>>>
>>> pd.DataFrame.from_records(row_data, columns=column_names)
    TimeStamp            InterfaceNum IndexNum SyntaxName Counter
0  1579051725  .1.3.4.1.2.1.1.1.1.1.1      128     STRING     AA1
>>>

Convert column data into a Pandas dataframe

The function also produces a dictionary as the second item of the returned tuple containing the same data, which could also produce the same dataframe using pd.DataFrame.from_dict().

>>> col_data = {'TimeStamp': ['1579051725'], 
...             'InterfaceNum': ['.1.3.4.1.2.1.1.1.1.1.1'], 
...             'IndexNum': ['128'], 'SyntaxName': ['STRING'], 
...             'Counter': ['AA1']}
>>> 
>>> pd.DataFrame.from_dict(col_data)
    TimeStamp            InterfaceNum IndexNum SyntaxName Counter
0  1579051725  .1.3.4.1.2.1.1.1.1.1.1      128     STRING     AA1
>>>

Dictionary example

Here are a few examples of filtering file data, initializing a persistent dictionary. Then filtering for more data and adding it to the persistent dictionary. I think this is also close to what's described in the question.

const_num    = '.1.3.4.1.2.1.1.1.1.1.1'
const_syntax = 'STRING'

result_5 = get_dev_data(file, iface_num=const_num, syntax=const_syntax)

# Use the results of the first dict retrieved to initialize the master
# dictionary.
master_dict = result_5[1]

print(f"master_dict = {master_dict}")

result_6 = get_dev_data(file, counter='Eth2|AA1')

# Add more records to the master dictionary.
for k, v in result_6[1].items():
    master_dict[k].extend(v)

print(f"master_dict = {master_dict}")

df = pandas.DataFrame.from_dict(master_dict)

print(f"
{df}")

Output

master_dict = {'TimeStamp': ['1579051725', '1579051725', '1579051725'], 'InterfaceNum': ['.1.3.4.1.2.1.1.1.1.1.1', '.1.3.4.1.2.1.1.1.1.1.1', '.1.3.4.1.2.1.1.1.1.1.1'], 'IndexNum': ['128', '129', '130'], 'SyntaxName': ['STRING', 'STRING', 'STRING'], 'Counter': ['AA1', 'Eth1', 'Eth2']}
master_dict = {'TimeStamp': ['1579051725', '1579051725', '1579051725', '1579051725', '1579051725'], 'InterfaceNum': ['.1.3.4.1.2.1.1.1.1.1.1', '.1.3.4.1.2.1.1.1.1.1.1', '.1.3.4.1.2.1.1.1.1.1.1', '.1.3.4.1.2.1.1.1.1.1.1', '.1.3.4.1.2.1.1.1.1.1.1'], 'IndexNum': ['128', '129', '130', '128', '130'], 'SyntaxName': ['STRING', 'STRING', 'STRING', 'STRING', 'STRING'], 'Counter': ['AA1', 'Eth1', 'Eth2', 'AA1', 'Eth2']}

    TimeStamp            InterfaceNum IndexNum SyntaxName Counter
0  1579051725  .1.3.4.1.2.1.1.1.1.1.1      128     STRING     AA1
1  1579051725  .1.3.4.1.2.1.1.1.1.1.1      129     STRING    Eth1
2  1579051725  .1.3.4.1.2.1.1.1.1.1.1      130     STRING    Eth2
3  1579051725  .1.3.4.1.2.1.1.1.1.1.1      128     STRING     AA1
4  1579051725  .1.3.4.1.2.1.1.1.1.1.1      130     STRING    Eth2

If all columns of the dictionary data aren't needed, keys in it can be dispensed with using .pop(). Or you could drop columns from any dataframe created off the data.

Groupings in compound regular expressions

This expression shows the expression that's evaluated in the function when all its parameters are left to their default values.

expr = r"""
    TIMESTAMP:\s(\d+)           # date    - TimeStamp
    |                           # ** OR **
    ((?:\.\d+)+)                # num     - InterfaceNum
        \.(\d+)\s=\s            # num2    - IndexNum
            (\w+):\s            # syntax  - SyntaxName
                (\w+)           # counter - Counter
    """

In the regular expression above, there are two alternative statements separated by the OR, | operator. These alternatives match either a line of timestamp data, or device data. And within these subexpressions are groupings to capture specific pieces of the string data. Match groups are created by putting parenthesis, (...), around a subexpression. The syntax for non-grouping parenthesis is (?:...).

No matter which alternative subexpression matches, there will still be the same number of match groups returned per successful call to re.findall(). Maybe a bit counterintuitive, but this is just how it works.

However, this feature does make it easy to write code to extract which fields of the match you've captured since you know the positions the groups should be at regardless of the subexpression matched:

     [, , , , ]
     # ^expr1^  ^.............expr2..............^

And since we have a predictable number of match groups regardless of which subexpression matches, it enables a pattern of looping that can be applied in many scenarios. By testing whether single match groups are empty or not, we know which branch within the loop to take to process the data for whichever subexpression got the hit.

        if tstamp:
            # First expression hit.
        elif num:
            # Second alt expression hit.

When the expression matches against the line of text that has the timestamp, the first subexpression hits, and its groups will be populated.

>>> re.findall(expr, "TIMESTAMP: 1579051725 20100114-202845", re.VERBOSE)
[('1579051725', '', '', '', '')]

Here, the first grouping from the expression is filled in and the other groups are blank. The other groupings belong to the other subexpression.

Now when the expression matches against the first line of device data, the second subexpression gets a hit, and its groups are populated. The timestamp groups are blank.

>>> re.findall(expr, ".1.3.4.1.2.1.1.1.1.1.1.128 = STRING: AA1", re.VERBOSE)
[('', '.1.3.4.1.2.1.1.1.1.1.1', '128', 'STRING', 'AA1')]

And finally, when neither subexpression matches, then the entire expression doesn't get a hit. In this case we get an empty list.

>>> re.findall(expr, "ifTb: name-nam-na", re.VERBOSE)
[]
>>>

For contrast, here's the expression without verbose syntax and documentation:

expr = r"TIMESTAMP:\s(\d+)|((?:\.\d+)+)\.(\d+)\s=\s(\w+):\s(\w+)"

Reading data using regular expression

Answers (2)

Related Questions