Reputation: 189
I have a tab-delimited text file that contains multiple tables. There are 13 known tables and each table has a set number of know columns (different number of columns for each table). However, each table may have anywhere from 1 to 10,000 records, and the number of records for each table is not known. I am trying to use Python/Pandas to read that text file and create multiple data frames. The file looks like the snippet I have shown below. Tables will always start with %T, Column headers will always start with %F and Records will always start with %R. Note that there is also always non-table information on the first few lines of the file and the number of rows of unrelated text varies from file to file.
Random Text[tab]More Random Text[tab]Even More Random Text
And Still More Random Text[tab]Text[tab]Text
%T[tab]TABLE_1_NAME
%F[tab]COL_1.1[tab]COL_1.2[tab]COL_1.3
%R[tab]Value[tab]Value[tab]Value
%R[tab]Value[tab]Value[tab]Value
%R[tab]Value[tab]Value[tab]Value
%T[tab]TABLE_2_NAME
%F[tab]COL_2.1[tab]COL_2.2[tab]COL_2.3[tab]COL_2.4
%R[tab]Value[tab]Value[tab]Value[tab]Value
%R[tab]Value[tab]Value[tab]Value[tab]Value
%R[tab]Value[tab]Value[tab]Value[tab]Value
%R[tab]Value[tab]Value[tab]Value[tab]Value
%R[tab]Value[tab]Value[tab]Value[tab]Value
...and so on...
%E
Any ideas how I could get multiple data frames from this file?
Upvotes: 1
Views: 105
Reputation: 178045
Parse the table name, column names, and data into a structure that can be traversed to make the data frames:
import csv
import pandas as pd
tables = {}
with open('input.tsv') as infile:
reader = csv.reader(infile, delimiter='\t')
for row_type, *cols in reader:
if row_type == '%T': # Table name
name = cols[0] # Remember name for %F and %R
tables[name] = {} # Add table name and empty dict
elif row_type == '%F': # Column names
tables[name]['header'] = cols # Update dict with header
tables[name]['rows'] = [] # and empty rows
elif row_type == '%R': # Row data
tables[name]['rows'].append(cols) # Add row data
# Create the dataframes
dataframes = {}
for key, value in tables.items():
dataframes[key] = pd.DataFrame(dict(zip(value['header'], value['rows'])))
# Display
for name, df in dataframes.items():
print(name)
print(df)
print()
Output:
TABLE_1_NAME
COL_1.1 COL_1.2 COL_1.3
0 Value Value Value
1 Value Value Value
2 Value Value Value
TABLE_2_NAME
COL_2.1 COL_2.2 COL_2.3 COL_2.4
0 Value Value Value Value
1 Value Value Value Value
2 Value Value Value Value
3 Value Value Value Value
Upvotes: 0
Reputation: 195573
I'm assuming the [tab]
is tab character \t
:
from io import StringIO
import pandas as pd
data = r"""Random Text More Random Text Even More Random Text
And Still More Random Text Text Text
%T TABLE_1_NAME
%F COL_1.1 COL_1.2 COL_1.3
%R Value Value Value
%R Value Value Value
%R Value Value Value
%T TABLE_2_NAME
%F COL_2.1 COL_2.2 COL_2.3 COL_2.4
%R Value Value Value Value
%R Value Value Value Value
%R Value Value Value Value
%R Value Value Value Value
%R Value Value Value Value
%E"""
tables = {}
for line in map(str.strip, data.splitlines()):
if line.startswith("%T"):
current_table = list()
tables[line.split(maxsplit=1)[-1]] = current_table
elif line.startswith(("%F", "%R")):
current_table.append(line.split(maxsplit=1)[-1])
for k, v in tables.items():
df = pd.read_csv(StringIO("\n".join(v)), sep="\t")
print("TABLE =", k)
print(df)
print("-" * 80)
Prints:
TABLE = TABLE_1_NAME
COL_1.1 COL_1.2 COL_1.3
0 Value Value Value
1 Value Value Value
2 Value Value Value
--------------------------------------------------------------------------------
TABLE = TABLE_2_NAME
COL_2.1 COL_2.2 COL_2.3 COL_2.4
0 Value Value Value Value
1 Value Value Value Value
2 Value Value Value Value
3 Value Value Value Value
4 Value Value Value Value
--------------------------------------------------------------------------------
Upvotes: 0
Reputation: 114035
First, split the file into multiple pandas-readable files. Then load each file:
import itertools
def splitFile(infilepath):
outfileNum = itertools.count(1)
with open(infilepath) as infile:
for line in infile:
if not line.startswith("%T"): continue
outfilepath = f"table_{next(outfileNum)}.tsv"
outfile = open(outfilepath, 'w')
outfile.write(line)
for line in infile:
if line.startswith("%T"):
outfile.close()
outfile = open(f"table_{next(outfileNum)}.tsv", 'w')
outfile.write(line)
Now that you have a bunch of tsv files, you can load them all in a list of dataframes:
import glob
import pandas as pd
dfs = [pd.read_csv(fname, delimiter='\t') for fname in glob.glob("*.tsv")]
Upvotes: 1