Reputation: 86
I'm struggling how to parse a text file into a pandas dataframe.
I think to use pandas.read_csv(sep='')
but I can not figure out the right configuration since the file uses blanks as seperator, but also contains text with separator
A sample data rows looks like this
<123> 2022-12-08T14:00:00 tag [id="451" tid="145] text message with commas
which is a line for this table
type | time | part | ids | message |
---|---|---|---|---|
<123> | 2022-12-08T14:00:00 | tag | [id="451" tid="145] | text message with commas |
Upvotes: 1
Views: 66
Reputation: 195653
You can try to use re
module to parse the data:
import re
import pandas as pd
s = '<123> 2022-12-08T14:00:00 tag [id="451" tid="145] text message with commas'
pat = re.compile(r"(\S+) (\S+) (\S+) (\[.*?\]) (.*)")
df = pd.DataFrame(pat.findall(s), columns=["type", "time", "part", "ids", "message"])
print(df)
Prints:
type time part ids message
0 <123> 2022-12-08T14:00:00 tag [id="451" tid="145] text message with commas
Upvotes: 1
Reputation: 11
I would propose not to use the read_csv
function to parse this text file (as I believe it is a rather specific use case where a blank space is to be considered only sometimes as a separator).
I wrote a small sample that shows how to parse the file by programmatically reading line by line and parsing based on the general logical structure of your data. Basically taking advantage of the square brackets that the "ids" field has.
Here is the code sample:
import pandas as pd
data_list = []
for line in open("example.csv","r"):
# Separate the line by spaces into a list
data = {}
line = line.split(" ")
# The first 3 elements correspond to "type", "time" and "part"
data["type"] = line[0]
data["time"] = line[1]
data["part"] = line[2]
# Then from the third position onward, concatenate each element until we find a closing square bracket
# We will call this the id's field
data['ids'] = ""
for i in range(3,len(line)):
data["ids"] = " ".join((data["ids"],line[i]))
if line[i][-1] == "]":
break
# Finally, we will concatenate the rest of the elements into the "message" field
data["message"] = " ".join(line[i+1:])
# And we will append the data to a list
data_list.append(data)
# Now we will create a pandas dataframe from the list
df = pd.DataFrame(data_list)
print(df)
Upvotes: 1