Reputation: 702
This is a small example of a bigger data.
I have a text file like this one below.
Code: 44N
______________
Unit: m
Color: red
Length - Width - Height -
31 - 8 - 6 -
32 - 4 - 3 -
35 - 5 - 6 -
----------------------------------------
Code: 40N
______________
Unit: m
Color: blue
Length - Width - Height -
32 - 3 - 2 -
37 - 2 - 8 -
33 - 1 - 6 -
31 - 5 - 8 -
----------------------------------------
Code: 38N
I would like to get the lines containing the text that starts with " Length" until the line that starts with "----------------------------------------". I would like to do this for every time it happens and then convert each of this new data in a dataframe... maybe adding it to a list of dataframes.
At this example, I should have two dataframes like these ones:
Length Width Height
31 8 6
32 4 3
35 5 6
Length Width Height
32 3 2
37 2 8
33 1 6
31 5 8
I already tried something, but it only saves one text to the list and not both of them. And then I don't know how to convert them to a dataframe.
file = open('test.txt', 'r')
file_str = file.read()
well_list = []
def find_between(data, first, last):
start = data.index(first)
end = data.index(last, start)
return data[start:end]
well_list.append(find_between(file_str, " Length", "----------------------------------------" ))
Anyone could help me?
Upvotes: 0
Views: 74
Reputation: 331
Hey that shows how parsing data can be tricky. Use .split()
method off strings to do the job. Here is a way to do it.
import pandas as pd
import numpy as np
with open('test.txt', 'r') as f:
text = f.read()
data_start = 'Length - Width - Height'
data_end = '----------------------------------------'
# split the text in sections containing the data
sections = text.split(data_start)[1:]
# number of columns in the dataframe
col_names = data_start.split('-')
num_col = len(col_names)
for s in sections:
# remove everything after '------'
s = s.split(data_end)[0]
# now keep the numbers only
data = s.split()
# change string to int and discard '-'
data = [int(d) for d in data if d!='-']
# reshape the data (num_rows, num_cols)
data = np.array(data).reshape((int(len(data)/num_col), num_col))
df = pd.DataFrame(data, columns=col_names)
print(df)
Upvotes: 1