python regex to read text file and split row to column

Question

My text file format is like below

ID  col_A   col_B   col_C
1   0.26    0.11    0.18
2   0.27    0.12    0.17
3   0.21    0.10    0.15
----------------------------
AVG 0.25    0.11    0.17
----------------------------
ID  col_D   col_E   col_F
1   0.23    0.18    0.20
2   0.24    0.14    0.17
3   0.23    0.10    0.13
----------------------------
AVG 0.23    0.14    0.17
----------------------------

I'm attempting to use python and regex to export two separate csv files with the format like below

Table 1

ID	col_A	col_B	col_C	col_D	col_E	col_F
1	0.26	0.11	0.18	0.23	0.18	0.20
2	0.27	0.12	0.17	0.24	0.14	0.17
3	0.21	0.10	0.15	0.23	0.10	0.13

Table 2

	col_A	col_B	col_C	col_D	col_E	col_F
AVG	0.25	0.11	0.17	0.23	0.14	0.17

Here's my code:

import re
import pandas as pd


with open('test.txt') as file:
        lines = file.readlines()
        regex = r'\A(?P\S+)\s*(?P\S+)\s*(?P\S+)\s*(?P\S+)'
        data = []
        
        for line in lines:
            m = re.search(regex, line)
            if m != None:
                data.append([m.group(1),m.group(2),m.group(3),m.group(4)])
        
        df = pd.DataFrame(data)
        
df.to_csv('test.csv', index = False)

My code would result in a strange format like

0	1	2	3
ID	col_A	col_B	col_C
1	0.26	0.11	0.18
2	0.27	0.12	0.17
3	0.21	0.10	0.15
------	---------	---------	---------
AVG	0.25	0.11	0.17
------	---------	---------	---------
ID	col_D	col_E	col_F
1	0.23	0.18	0.20
2	0.24	0.14	0.17
3	0.23	0.10	0.13
------	---------	---------	---------
AVG	0.23	0.14	0.17
------	---------	---------	---------

How can I modify my code to achieve my request? Thank you!

Wiktor Stribiżew · Accepted Answer

You can use

import re, itertools
import pandas as pd

data = []
colnames = []
avg = []
start = True
is_next_avg = False
id_start_label = ''

with open('test.txt', 'r') as file:
    for line in file:
        line = line.strip()
        if start and line.replace('-', '').strip():
            id_start_label = line.split()[0]
            colnames.extend(line.split()[1:])
            start = False
        elif '---' in line and not line.replace('-', '').strip():
            if start:
                continue
            is_next_avg = not is_next_avg
        elif is_next_avg:
            avg.extend(line.split()[1:])
            is_next_avg = not is_next_avg
            start = not start
        else:
            data.append(line.split())

colnames.insert(0, id_start_label)
data = sorted(data, key=lambda x: x[0])
new_data = []
for key, g in itertools.groupby(data, key=lambda x: x[0]):
    entry = [key]
    entry.extend( [i for x in g for i in x[1:]] )
    new_data.append(entry)

df = pd.DataFrame(new_data, columns=colnames)
df2 = pd.DataFrame(avg, columns=['AVG'])
df2 = df2.transpose()
df2.columns = df.columns[1:]

Output:

>>> df
  ID col_A col_B col_C col_D col_E col_F
0  1  0.26  0.11  0.18  0.23  0.18  0.20
1  2  0.27  0.12  0.17  0.24  0.14  0.17
2  3  0.21  0.10  0.15  0.23  0.10  0.13

>>> df2
    col_A col_B col_C col_D col_E col_F
AVG  0.25  0.11  0.17  0.23  0.14  0.17

python regex to read text file and split row to column

Answers (2)

Related Questions