Matthew
Matthew

Reputation: 189

Read Tab-Delimited Text File with Multiple Tables, Python/Pandas

I have a tab-delimited text file that contains multiple tables. There are 13 known tables and each table has a set number of know columns (different number of columns for each table). However, each table may have anywhere from 1 to 10,000 records, and the number of records for each table is not known. I am trying to use Python/Pandas to read that text file and create multiple data frames. The file looks like the snippet I have shown below. Tables will always start with %T, Column headers will always start with %F and Records will always start with %R. Note that there is also always non-table information on the first few lines of the file and the number of rows of unrelated text varies from file to file.

Random Text[tab]More Random Text[tab]Even More Random Text
And Still More Random Text[tab]Text[tab]Text
%T[tab]TABLE_1_NAME
%F[tab]COL_1.1[tab]COL_1.2[tab]COL_1.3
%R[tab]Value[tab]Value[tab]Value
%R[tab]Value[tab]Value[tab]Value
%R[tab]Value[tab]Value[tab]Value
%T[tab]TABLE_2_NAME
%F[tab]COL_2.1[tab]COL_2.2[tab]COL_2.3[tab]COL_2.4
%R[tab]Value[tab]Value[tab]Value[tab]Value
%R[tab]Value[tab]Value[tab]Value[tab]Value
%R[tab]Value[tab]Value[tab]Value[tab]Value
%R[tab]Value[tab]Value[tab]Value[tab]Value
%R[tab]Value[tab]Value[tab]Value[tab]Value
...and so on...
%E

Any ideas how I could get multiple data frames from this file?

Upvotes: 1

Views: 105

Answers (3)

Mark Tolonen
Mark Tolonen

Reputation: 178045

Parse the table name, column names, and data into a structure that can be traversed to make the data frames:

import csv
import pandas as pd

tables = {}
with open('input.tsv') as infile:
    reader = csv.reader(infile, delimiter='\t')
    for row_type, *cols in reader:
        if row_type == '%T':  # Table name
            name = cols[0]    # Remember name  for %F and %R
            tables[name] = {} # Add table name and empty dict
        elif row_type == '%F':  # Column names
            tables[name]['header'] = cols  # Update dict with header
            tables[name]['rows'] = []      # and empty rows
        elif row_type == '%R':  # Row data
            tables[name]['rows'].append(cols)  # Add row data

# Create the dataframes
dataframes = {}
for key, value in tables.items():
    dataframes[key] = pd.DataFrame(dict(zip(value['header'], value['rows'])))

# Display
for name, df in dataframes.items():
    print(name)
    print(df)
    print()

Output:

TABLE_1_NAME
  COL_1.1 COL_1.2 COL_1.3
0   Value   Value   Value
1   Value   Value   Value
2   Value   Value   Value

TABLE_2_NAME
  COL_2.1 COL_2.2 COL_2.3 COL_2.4
0   Value   Value   Value   Value
1   Value   Value   Value   Value
2   Value   Value   Value   Value
3   Value   Value   Value   Value

Upvotes: 0

Andrej Kesely
Andrej Kesely

Reputation: 195573

I'm assuming the [tab] is tab character \t:

from io import StringIO

import pandas as pd

data = r"""Random Text More Random Text    Even More Random Text
And Still More Random Text  Text Text
%T  TABLE_1_NAME
%F  COL_1.1  COL_1.2  COL_1.3
%R  Value    Value  Value
%R  Value    Value  Value
%R  Value    Value  Value
%T  TABLE_2_NAME
%F  COL_2.1  COL_2.2  COL_2.3  COL_2.4
%R  Value    Value  Value    Value
%R  Value    Value  Value    Value
%R  Value    Value  Value    Value
%R  Value    Value  Value    Value
%R  Value    Value  Value    Value
%E"""

tables = {}
for line in map(str.strip, data.splitlines()):
    if line.startswith("%T"):
        current_table = list()
        tables[line.split(maxsplit=1)[-1]] = current_table
    elif line.startswith(("%F", "%R")):
        current_table.append(line.split(maxsplit=1)[-1])

for k, v in tables.items():
    df = pd.read_csv(StringIO("\n".join(v)), sep="\t")
    print("TABLE =", k)
    print(df)
    print("-" * 80)

Prints:

TABLE = TABLE_1_NAME
  COL_1.1  COL_1.2  COL_1.3
0     Value    Value  Value
1     Value    Value  Value
2     Value    Value  Value
--------------------------------------------------------------------------------
TABLE = TABLE_2_NAME
  COL_2.1  COL_2.2  COL_2.3  COL_2.4
0     Value    Value  Value    Value
1     Value    Value  Value    Value
2     Value    Value  Value    Value
3     Value    Value  Value    Value
4     Value    Value  Value    Value
--------------------------------------------------------------------------------

Upvotes: 0

inspectorG4dget
inspectorG4dget

Reputation: 114035

First, split the file into multiple pandas-readable files. Then load each file:

import itertools

def splitFile(infilepath):
    outfileNum = itertools.count(1)
    with open(infilepath) as infile:
        for line in infile:
            if not line.startswith("%T"): continue
        outfilepath = f"table_{next(outfileNum)}.tsv"
        outfile = open(outfilepath, 'w')
        outfile.write(line)
        for line in infile:
            if line.startswith("%T"):
                outfile.close()
                outfile = open(f"table_{next(outfileNum)}.tsv", 'w')

            outfile.write(line)

Now that you have a bunch of tsv files, you can load them all in a list of dataframes:

import glob
import pandas as pd

dfs = [pd.read_csv(fname, delimiter='\t') for fname in glob.glob("*.tsv")]

Upvotes: 1

Related Questions