MK1986
MK1986

Reputation: 86

pandas.read_csv seperator with Regex

I'm struggling how to parse a text file into a pandas dataframe. I think to use pandas.read_csv(sep='') but I can not figure out the right configuration since the file uses blanks as seperator, but also contains text with separator

A sample data rows looks like this

<123> 2022-12-08T14:00:00 tag [id="451" tid="145] text message with commas

which is a line for this table

type time part ids message
<123> 2022-12-08T14:00:00 tag [id="451" tid="145] text message with commas

Upvotes: 1

Views: 66

Answers (2)

Andrej Kesely
Andrej Kesely

Reputation: 195653

You can try to use re module to parse the data:

import re

import pandas as pd

s = '<123> 2022-12-08T14:00:00 tag [id="451" tid="145] text message with commas'

pat = re.compile(r"(\S+) (\S+) (\S+) (\[.*?\]) (.*)")

df = pd.DataFrame(pat.findall(s), columns=["type", "time", "part", "ids", "message"])
print(df)

Prints:

    type                 time part                  ids                   message
0  <123>  2022-12-08T14:00:00  tag  [id="451" tid="145]  text message with commas

Upvotes: 1

Xanth
Xanth

Reputation: 11

I would propose not to use the read_csv function to parse this text file (as I believe it is a rather specific use case where a blank space is to be considered only sometimes as a separator).

I wrote a small sample that shows how to parse the file by programmatically reading line by line and parsing based on the general logical structure of your data. Basically taking advantage of the square brackets that the "ids" field has.

Here is the code sample:

import pandas as pd

data_list = []
for line in open("example.csv","r"):
    # Separate the line by spaces into a list
    data = {}
    line = line.split(" ")

    # The first 3 elements correspond to "type", "time" and "part"
    data["type"] = line[0]
    data["time"] = line[1]
    data["part"] = line[2]

    # Then from the third position onward, concatenate each element until we find a closing square bracket
    # We will call this the id's field
    data['ids'] = ""
    for i in range(3,len(line)):
        data["ids"] = " ".join((data["ids"],line[i]))
        if line[i][-1] == "]":
            break

    # Finally, we will concatenate the rest of the elements into the "message" field
    data["message"] = " ".join(line[i+1:])

    # And we will append the data to a list
    data_list.append(data)

# Now we will create a pandas dataframe from the list
df = pd.DataFrame(data_list)

print(df)

Upvotes: 1

Related Questions