Reputation: 453
I want to parse a txt file into pandas dataframe. The file has info in the following format,
Note that number of items N is different for each line.
# some useless header lines
< ID1 , rest of the line I am not interested in>
< item1.val1 item1.val2 item1.val3 item2.val1 item2.val2 item2.val3 .... itemN.val1 item2.val2 item3.val3>
< ID2 , rest of the line I am not interested in>
< item1.val1 item1.val2 item1.val3 item2.val1 item2.val2 item2.val3 .... itemN.val1 item2.val2 item3.val3>
...
At present I am using this approach,
file = open(file_path)
line = file.readline()
list_dfs = []
all_df = pd.DataFrame(columns=['ID', 'val1', 'val2', 'val3'])
while line:
if line[0] == '#':
line = file.readline()
continue
id = line.split()[0]
line = file.readline()
list_dfs.append(process_line_info(line, id))
line = file.readline()
file.close()
all_df = pd.concat(list_dfs)
def process_line_info(txt, id)->pd.DataFrame:
entries = txt.split()
if not len(entries)%3 == 0:
print('Error with line, not a triple, quitting')
exit(-1)
numItems = math.ceil(len(entries)/3)
df = pd.DataFrame(columns=['val1', 'val2', 'val3', 'id'])
for i in range(numItems):
val1 = entries[3*i]
val2 = entries[3*i + 1]
val3 = entries[3*i + 2]
if(val3) != -1:
row = pd.Series(
{
'val1': val1,
'val2': val2,
'val3': val3,
'id' : id,
}
)
df = df.append(row, ignore_index=True)
return df
The dataframe will look something like this,
ID val1 val2 val3
0 234 345 112
0 111 333 232
1 ..
1 ..
1 ..
1 ..
2 ..
2 ..
3 ..
3 ..
3 ..
3 ..
3 ..
.....
.....
N ..
Is there a way to do this more efficiently? This is painfully slow.
Here's an example,
# some random txt
# some more random txt
1 0.851773 0.0165051 0.503764 -0.142941 -0.737434 1.02973 3.74354
2362.39 248.498 58396 1784.7 268.254 59027 1784.7 268.254 -1
2 0.851773 0.0165051 0.503764 -0.142941 -0.737434 1.02973 3.74354
1190.83 663.957 23056 1258.77 640.354 59070
Upvotes: 0
Views: 84
Reputation: 16747
I've used numpy
in my solution, as I'm more familiar with it. I expect my code to be faster, try it on your big data. You can also run this code online here.
import io, numpy as np, pandas as pd
with open('input.txt', 'r', encoding = 'utf-8') as f:
lines = f.read().splitlines()
lines = [line.strip() for line in lines]
lines = [line for line in lines if line and not line.startswith('#')]
assert len(lines) % 2 == 0, len(lines)
lines = [
(
int(id_line.partition(' ')[0]),
np.loadtxt(io.StringIO(data_line), dtype = np.float64),
)
for id_line, data_line in zip(lines[0::2], lines[1::2])
]
ids = np.concatenate([np.full((v.size // 3,), k, dtype = np.int64) for k, v in lines])
nums = np.concatenate([v.reshape((v.size // 3, 3)) for k, v in lines])
df = pd.DataFrame(index = ids, data = nums, columns = ['val1', 'val2', 'val3'])
print(df)
Upvotes: 1
Reputation: 1
well, you just need the transition from txt file to DF, you can use pandas.read_csv("path") and you're all set to go
Upvotes: 0