user026
user026

Reputation: 702

How to separate text between specific strings and then convert it to dataframe?

This is a small example of a bigger data.

I have a text file like this one below.

Code: 44N
______________
Unit: m
Color: red

    Length  - Width -  Height  -

    31      -   8   -    6     -
    32      -   4   -    3     -
    35      -   5   -    6     -
----------------------------------------

Code: 40N
______________
Unit: m
Color: blue

    Length  - Width -  Height  -

    32      -   3   -    2     -
    37      -   2   -    8     -
    33      -   1   -    6     -
    31      -   5   -    8     -
----------------------------------------

Code: 38N

I would like to get the lines containing the text that starts with " Length" until the line that starts with "----------------------------------------". I would like to do this for every time it happens and then convert each of this new data in a dataframe... maybe adding it to a list of dataframes.

At this example, I should have two dataframes like these ones:

Length  Width   Height  
  31      8       6     
  32      4       3     
  35      5       6  

Length  Width   Height  
  32      3       2     
  37      2       8     
  33      1       6  
  31      5       8

I already tried something, but it only saves one text to the list and not both of them. And then I don't know how to convert them to a dataframe.

file = open('test.txt', 'r')
file_str = file.read()

well_list = []
def find_between(data, first, last):
    start = data.index(first)
    end = data.index(last, start)
    return data[start:end]

well_list.append(find_between(file_str, "    Length", "----------------------------------------" ))

Anyone could help me?

Upvotes: 0

Views: 74

Answers (1)

Sura-da
Sura-da

Reputation: 331

Hey that shows how parsing data can be tricky. Use .split() method off strings to do the job. Here is a way to do it.

import pandas as pd
import numpy as np

with open('test.txt', 'r') as f:
    text = f.read()

data_start = 'Length  - Width -  Height'
data_end = '----------------------------------------'

# split the text in sections containing the data
sections = text.split(data_start)[1:]

# number of columns in the dataframe
col_names = data_start.split('-')
num_col = len(col_names)

for s in sections:
    # remove everything after '------'
    s = s.split(data_end)[0]
    # now keep the numbers only
    data = s.split()
    # change string to int and discard '-'
    data = [int(d) for d in data if d!='-']
    # reshape the data (num_rows, num_cols)
    data = np.array(data).reshape((int(len(data)/num_col), num_col))
    df = pd.DataFrame(data, columns=col_names)
    print(df)

Upvotes: 1

Related Questions