Converting irregular text file to orderly dataframe

Question

I understand there will be no "prefab" option for what I'm trying to do.

I have a series of text files like this that are combined using a combination of grep and sed from other tools.

Example file "stacking-IVT7.dat" and it's content

./stacking_t13/ALL-stacking-13.dat   #this is a line in the file, for disambiguation 
==> stacking-count-11DG.dat <==
0.8822 Undefined
0.1178 stacked
==> stacking-count-12DT.dat <==
0.9321 Undefined
0.0679 stacked
==> stacking-count-14DG.dat <==
0.1701 Undefined
0.8299 stacked

I want to read them into a pd.dataframe and construct it like this:

Interaction IVT7
13-vs-11DG  0.1178 
13-vs-12DT  0.0679
13-vs-14DG  0.8299

You can see I'll bee 'pulling' selectively from the file for the left column name and from the file name for the column header. This seems like a combination problem for pd.read_csv() and re.findall()

I don't know where to begin.. or how to combine these two functions in a meaningful way.

edit: I've googled and read a fair amount on pd.read_csv(). But it doenst seem to be built do do what I want in the slightest.
I can get it to import structured (csv like) text files successfully and have written a script here that works a treat. https://github.com/PwnusMaximus/md_scripts/blob/0ad82d6dbc096af4422ea625c29f4c0b0bfb4b95/analysis/combine-hbond-avg.py

I also know (rather grossly) how to rip this file apart using sed to get it mostly cleaned up how I want. (this is very inefficient i know)

 sed -i '/Undefined/d' *.dat 
 sed -i 's/stacked//g' *.dat 
 sed -i 's/*[0-9]\+[A-Z]\+*/[0-9]\+[A-Z]\+/' *.dat

however on the nature of getting pd.read_csv() to actually import this file im at a loss and havent been able to get it to parse in anyway other than

df_final = pd.read_csv('super-duper-stacking-IVT7.dat', header=None)

edit2 clarified the file content vs file name above

Converting irregular text file to orderly dataframe

Answers (1)

Related Questions