thyhmoo
thyhmoo

Reputation: 323

Turn txt-file into pandas dataframe

I have a text file and want to turn it into a pandas dataframe. Unfortunately, the structure of the text file makes this quite hard for me.

The text-file looks like this:

======> EVENT:130
a = 1.4
b = 2.5
c = 1.7
======> EVENT:698
a = 1.2
b = 4.3
c = 2.4
======> EVENT:1055
a = 3.4
b = 4.5
c = 2.3

The number after "EVENT:" is random. I would like to turn this text-file into a pandas dataframe with the following structure

  a    b    c 
1.4  2.5  1.7
1.2  4.3  2.4
3.4  4.5  2.3

So I want every Event in one row of the dataframe. How can I delete the "======> EVENT:xxxx" lines? Or how can I use this line to tell pd.read_csv() to start a new row in the df? And is there a way to make pd.read_csv() realize that on the left side of "=" is the column name and on the right side the entry for that column?

Upvotes: 1

Views: 521

Answers (3)

Alexander Volkovsky
Alexander Volkovsky

Reputation: 2918

You can create a custom file-like object (something similar to the adapter) for pd.read_csv function.

Benefits: fast and memory efficient

Here is my implementation

import pandas as pd


# custom file-like object
# it must define __iter__() and read() methods
class EventFile:
    def __init__(self, path):
        self.path = path
        self.fd = None
        self.lineno = 0

    # use as context manager to close file at the end of processing
    def __enter__(self):
        self.fd = open(self.path)
        return self

    def __exit__(self, *args, **kwargs):
        self.fd.close()

    # just return lines like standard open() function
    def __iter__(self):
        while True:
            line = self.read()
            if not line:
                return
            yield line

    # actually returns csv lines
    def read(self, size=None):
        columns = {}  # our buffer for future columns {<name>:<value>}
        while True:
            # read line from the underlying file
            line = self.fd.readline()

            # handle EOF
            if not line:
                return self._build_csv_line(columns.values()) if columns else ""

            # skip the very first row
            if "EVENT" in line and not columns:
                continue

            # flush the buffer
            if "EVENT" in line:
                self.lineno += 1
                csv_line = ""
                
                if self.lineno == 1:
                    # emit header
                    csv_line += self._build_csv_line(columns.keys())

                csv_line += self._build_csv_line(columns.values())
                return csv_line

            name, value = line.split("=")
            columns[name.strip()] = value.strip()

    def _build_csv_line(self, values):
        return ",".join(values) + "\n"


# Usage
with EventFile("test.csv") as f:
    data = pd.read_csv(f)

Upvotes: 1

Scott Boston
Scott Boston

Reputation: 153460

Try this:

from io import StringIO
import pandas as pd

txtfile = StringIO("""======> EVENT:130
a = 1.4
b = 2.5
c = 1.7
======> EVENT:698
a = 1.2
b = 4.3
c = 2.4
======> EVENT:1055
a = 3.4
b = 4.5
c = 2.3""")

df = pd.read_csv(txtfile, sep = '\s\s+', header=None, engine='python')

dfi = df[0].str.split(' = ', expand=True)
dfi = dfi[dfi[0].isin([*'abc'])]
df_out = dfi.set_index([dfi.groupby(0).cumcount(), 0])[1].unstack()
print(df_out)

Output:

0    a    b    c
0  1.4  2.5  1.7
1  1.2  4.3  2.4
2  3.4  4.5  2.3

Upvotes: 1

Andrej Kesely
Andrej Kesely

Reputation: 195408

You can make custom parser using re module. For example:

import re
import pandas as pd

with open("your_file.txt", "r") as f_in:
    data = f_in.read()

all_data = []
for event in re.split(r".*EVENT.*", data):
    if not event.strip():
        continue
    d = {}
    for k, v in re.findall(r"\s*(.*)\s*=\s*(.*)\s*", event):
        d[k] = v
    all_data.append(d)

df = pd.DataFrame(all_data)
print(df)

Prints:

    a    b    c 
0  1.4  2.5  1.7
1  1.2  4.3  2.4
2  3.4  4.5  2.3

Upvotes: 3

Related Questions