Makay Murray
Makay Murray

Reputation: 11

Converting irregular text file to orderly dataframe

I understand there will be no "prefab" option for what I'm trying to do.

I have a series of text files like this that are combined using a combination of grep and sed from other tools.

Example file "stacking-IVT7.dat" and it's content

./stacking_t13/ALL-stacking-13.dat   #this is a line in the file, for disambiguation 
==> stacking-count-11DG.dat <==
0.8822 Undefined
0.1178 stacked
==> stacking-count-12DT.dat <==
0.9321 Undefined
0.0679 stacked
==> stacking-count-14DG.dat <==
0.1701 Undefined
0.8299 stacked

I want to read them into a pd.dataframe and construct it like this:

Interaction IVT7
13-vs-11DG  0.1178 
13-vs-12DT  0.0679
13-vs-14DG  0.8299

You can see I'll bee 'pulling' selectively from the file for the left column name and from the file name for the column header. This seems like a combination problem for pd.read_csv() and re.findall()

I don't know where to begin.. or how to combine these two functions in a meaningful way.

edit: I've googled and read a fair amount on pd.read_csv(). But it doenst seem to be built do do what I want in the slightest.
I can get it to import structured (csv like) text files successfully and have written a script here that works a treat. https://github.com/PwnusMaximus/md_scripts/blob/0ad82d6dbc096af4422ea625c29f4c0b0bfb4b95/analysis/combine-hbond-avg.py

I also know (rather grossly) how to rip this file apart using sed to get it mostly cleaned up how I want. (this is very inefficient i know)

 sed -i '/Undefined/d' *.dat 
 sed -i 's/stacked//g' *.dat 
 sed -i 's/*[0-9]\+[A-Z]\+*/[0-9]\+[A-Z]\+/' *.dat 

however on the nature of getting pd.read_csv() to actually import this file im at a loss and havent been able to get it to parse in anyway other than

df_final = pd.read_csv('super-duper-stacking-IVT7.dat', header=None)

edit2 clarified the file content vs file name above

Upvotes: 1

Views: 206

Answers (1)

pho
pho

Reputation: 25489

You have correctly recognized that there is no ready-made solution for what you're trying to do. You're going to have to read the file line-by-line, and build a data structure containing the information you want.

You can use a regular expression to extract e.g. the 11DG part of stacking-count-11DG.dat

Consider the following:

import re
import pandas as pd

# This regex captures anything after stacking-count- and before .dat
interaction_regex = re.compile(r"stacking-count-(.*?).dat") 

all_data = [] # Empty list to hold all data

current_interaction = ""

with open("stacking-IVT7.dat") as f:
    for line in f:
        line = line.strip() # Strip the line
        if not line: continue # Ig the line is empty, move to the next line

        # If the line begins and ends with arrows, it is a filename so try to extract the interaction from it
        if line.startswith("==>") and line.endswith("<=="):
            inter = interaction_regex.findall(line)
            if not inter: continue                     # if inter is empty, go to the next line
            current_interaction = f"13-vs-{inter[0]}"  # if not, set the currently active interaction

        # If the line doesn't begin and end with arrows, try to extract data from it
        # But only if current_interaction is not empty
        elif current_interaction:                      
            file_row = line.split()        # Split the line on whitespace
            if file_row[1] == "stacked":   
                # If the second element of the row is "stacked", 
                # Create a tuple containing the current_interaction and the number in this line
                df_row = (current_interaction, float(file_row[0])) 
                all_data.append(df_row) # Append the tuple to our list
                
                
df = pd.DataFrame(all_data, columns=["Interaction", "IVT7"])  # Create a dataframe using the data we read

Which gives the following dataframe:

  Interaction    IVT7
0  13-vs-11DG  0.1178
1  13-vs-12DT  0.0679
2  13-vs-14DG  0.8299

Upvotes: 1

Related Questions