Reputation: 199
My text file format is like below
ID col_A col_B col_C
1 0.26 0.11 0.18
2 0.27 0.12 0.17
3 0.21 0.10 0.15
----------------------------
AVG 0.25 0.11 0.17
----------------------------
ID col_D col_E col_F
1 0.23 0.18 0.20
2 0.24 0.14 0.17
3 0.23 0.10 0.13
----------------------------
AVG 0.23 0.14 0.17
----------------------------
I'm attempting to use python and regex to export two separate csv files with the format like below
Table 1
ID | col_A | col_B | col_C | col_D | col_E | col_F |
---|---|---|---|---|---|---|
1 | 0.26 | 0.11 | 0.18 | 0.23 | 0.18 | 0.20 |
2 | 0.27 | 0.12 | 0.17 | 0.24 | 0.14 | 0.17 |
3 | 0.21 | 0.10 | 0.15 | 0.23 | 0.10 | 0.13 |
Table 2
col_A | col_B | col_C | col_D | col_E | col_F | |
---|---|---|---|---|---|---|
AVG | 0.25 | 0.11 | 0.17 | 0.23 | 0.14 | 0.17 |
Here's my code:
import re
import pandas as pd
with open('test.txt') as file:
lines = file.readlines()
regex = r'\A(?P<ID>\S+)\s*(?P<COL_A>\S+)\s*(?P<COL_B>\S+)\s*(?P<COL_C>\S+)'
data = []
for line in lines:
m = re.search(regex, line)
if m != None:
data.append([m.group(1),m.group(2),m.group(3),m.group(4)])
df = pd.DataFrame(data)
df.to_csv('test.csv', index = False)
My code would result in a strange format like
0 | 1 | 2 | 3 |
---|---|---|---|
ID | col_A | col_B | col_C |
1 | 0.26 | 0.11 | 0.18 |
2 | 0.27 | 0.12 | 0.17 |
3 | 0.21 | 0.10 | 0.15 |
------ | --------- | --------- | --------- |
AVG | 0.25 | 0.11 | 0.17 |
------ | --------- | --------- | --------- |
ID | col_D | col_E | col_F |
1 | 0.23 | 0.18 | 0.20 |
2 | 0.24 | 0.14 | 0.17 |
3 | 0.23 | 0.10 | 0.13 |
------ | --------- | --------- | --------- |
AVG | 0.23 | 0.14 | 0.17 |
------ | --------- | --------- | --------- |
How can I modify my code to achieve my request? Thank you!
Upvotes: 2
Views: 634
Reputation: 26708
Something like this should work for you:
import re
import pandas as pd
import numpy as np
data = []
headers = []
averages = []
ids = []
with open('/content/text.txt') as file:
for l in file.readlines():
s = re.findall(r'\S+', l.rstrip().split(',')[0])
if '-' not in s[0]:
if 'ID' in s[0]: headers.append(s)
elif 'AVG' in s[0]: averages.append(s)
else:
data.append(s[1:])
ids.append(s[0])
data = np.hstack([data[:len(data)//len(headers)], data[len(data)//len(headers):]])
data = np.concatenate([np.expand_dims(np.transpose(sorted(set(ids))), axis=-1), data], axis=1)
columns = sorted(set(np.concatenate(headers)))
df = pd.DataFrame(data, columns=columns).astype({'ID': 'int64'})
main_df = df.set_index('ID')
Table 1:
col_A col_B col_C col_D col_E col_F
ID
1 0.26 0.11 0.18 0.23 0.18 0.20
2 0.27 0.12 0.17 0.24 0.14 0.17
3 0.21 0.10 0.15 0.23 0.10 0.13
Table 2:
average_df = pd.DataFrame(np.expand_dims(list(map(float, [d for d in np.concatenate(averages) if 'AVG' not in d])), axis=0), columns=columns[1:])
average_df = average_df.rename_axis('AVG')
col_A col_B col_C col_D col_E col_F
AVG
0 0.25 0.11 0.17 0.23 0.14 0.17
Upvotes: 1
Reputation: 626893
You can use
import re, itertools
import pandas as pd
data = []
colnames = []
avg = []
start = True
is_next_avg = False
id_start_label = ''
with open('test.txt', 'r') as file:
for line in file:
line = line.strip()
if start and line.replace('-', '').strip():
id_start_label = line.split()[0]
colnames.extend(line.split()[1:])
start = False
elif '---' in line and not line.replace('-', '').strip():
if start:
continue
is_next_avg = not is_next_avg
elif is_next_avg:
avg.extend(line.split()[1:])
is_next_avg = not is_next_avg
start = not start
else:
data.append(line.split())
colnames.insert(0, id_start_label)
data = sorted(data, key=lambda x: x[0])
new_data = []
for key, g in itertools.groupby(data, key=lambda x: x[0]):
entry = [key]
entry.extend( [i for x in g for i in x[1:]] )
new_data.append(entry)
df = pd.DataFrame(new_data, columns=colnames)
df2 = pd.DataFrame(avg, columns=['AVG'])
df2 = df2.transpose()
df2.columns = df.columns[1:]
Output:
>>> df
ID col_A col_B col_C col_D col_E col_F
0 1 0.26 0.11 0.18 0.23 0.18 0.20
1 2 0.27 0.12 0.17 0.24 0.14 0.17
2 3 0.21 0.10 0.15 0.23 0.10 0.13
>>> df2
col_A col_B col_C col_D col_E col_F
AVG 0.25 0.11 0.17 0.23 0.14 0.17
Upvotes: 1