Reputation: 323
I have a text file and want to turn it into a pandas dataframe. Unfortunately, the structure of the text file makes this quite hard for me.
The text-file looks like this:
======> EVENT:130
a = 1.4
b = 2.5
c = 1.7
======> EVENT:698
a = 1.2
b = 4.3
c = 2.4
======> EVENT:1055
a = 3.4
b = 4.5
c = 2.3
The number after "EVENT:" is random. I would like to turn this text-file into a pandas dataframe with the following structure
a b c
1.4 2.5 1.7
1.2 4.3 2.4
3.4 4.5 2.3
So I want every Event in one row of the dataframe. How can I delete the "======> EVENT:xxxx" lines? Or how can I use this line to tell pd.read_csv() to start a new row in the df? And is there a way to make pd.read_csv() realize that on the left side of "=" is the column name and on the right side the entry for that column?
Upvotes: 1
Views: 521
Reputation: 2918
You can create a custom file-like object (something similar to the adapter) for pd.read_csv
function.
Benefits: fast and memory efficient
Here is my implementation
import pandas as pd
# custom file-like object
# it must define __iter__() and read() methods
class EventFile:
def __init__(self, path):
self.path = path
self.fd = None
self.lineno = 0
# use as context manager to close file at the end of processing
def __enter__(self):
self.fd = open(self.path)
return self
def __exit__(self, *args, **kwargs):
self.fd.close()
# just return lines like standard open() function
def __iter__(self):
while True:
line = self.read()
if not line:
return
yield line
# actually returns csv lines
def read(self, size=None):
columns = {} # our buffer for future columns {<name>:<value>}
while True:
# read line from the underlying file
line = self.fd.readline()
# handle EOF
if not line:
return self._build_csv_line(columns.values()) if columns else ""
# skip the very first row
if "EVENT" in line and not columns:
continue
# flush the buffer
if "EVENT" in line:
self.lineno += 1
csv_line = ""
if self.lineno == 1:
# emit header
csv_line += self._build_csv_line(columns.keys())
csv_line += self._build_csv_line(columns.values())
return csv_line
name, value = line.split("=")
columns[name.strip()] = value.strip()
def _build_csv_line(self, values):
return ",".join(values) + "\n"
# Usage
with EventFile("test.csv") as f:
data = pd.read_csv(f)
Upvotes: 1
Reputation: 153460
Try this:
from io import StringIO
import pandas as pd
txtfile = StringIO("""======> EVENT:130
a = 1.4
b = 2.5
c = 1.7
======> EVENT:698
a = 1.2
b = 4.3
c = 2.4
======> EVENT:1055
a = 3.4
b = 4.5
c = 2.3""")
df = pd.read_csv(txtfile, sep = '\s\s+', header=None, engine='python')
dfi = df[0].str.split(' = ', expand=True)
dfi = dfi[dfi[0].isin([*'abc'])]
df_out = dfi.set_index([dfi.groupby(0).cumcount(), 0])[1].unstack()
print(df_out)
Output:
0 a b c
0 1.4 2.5 1.7
1 1.2 4.3 2.4
2 3.4 4.5 2.3
Upvotes: 1
Reputation: 195408
You can make custom parser using re
module. For example:
import re
import pandas as pd
with open("your_file.txt", "r") as f_in:
data = f_in.read()
all_data = []
for event in re.split(r".*EVENT.*", data):
if not event.strip():
continue
d = {}
for k, v in re.findall(r"\s*(.*)\s*=\s*(.*)\s*", event):
d[k] = v
all_data.append(d)
df = pd.DataFrame(all_data)
print(df)
Prints:
a b c
0 1.4 2.5 1.7
1 1.2 4.3 2.4
2 3.4 4.5 2.3
Upvotes: 3