Reputation: 11
I understand there will be no "prefab" option for what I'm trying to do.
I have a series of text files like this that are combined using a combination of grep and sed from other tools.
Example file "stacking-IVT7.dat" and it's content
./stacking_t13/ALL-stacking-13.dat #this is a line in the file, for disambiguation
==> stacking-count-11DG.dat <==
0.8822 Undefined
0.1178 stacked
==> stacking-count-12DT.dat <==
0.9321 Undefined
0.0679 stacked
==> stacking-count-14DG.dat <==
0.1701 Undefined
0.8299 stacked
I want to read them into a pd.dataframe and construct it like this:
Interaction IVT7
13-vs-11DG 0.1178
13-vs-12DT 0.0679
13-vs-14DG 0.8299
You can see I'll bee 'pulling' selectively from the file for the left column name and from the file name for the column header. This seems like a combination problem for pd.read_csv() and re.findall()
I don't know where to begin.. or how to combine these two functions in a meaningful way.
edit:
I've googled and read a fair amount on pd.read_csv(). But it doenst seem to be built do do what I want in the slightest.
I can get it to import structured (csv like) text files successfully and have written a script here that works a treat. https://github.com/PwnusMaximus/md_scripts/blob/0ad82d6dbc096af4422ea625c29f4c0b0bfb4b95/analysis/combine-hbond-avg.py
I also know (rather grossly) how to rip this file apart using sed to get it mostly cleaned up how I want. (this is very inefficient i know)
sed -i '/Undefined/d' *.dat
sed -i 's/stacked//g' *.dat
sed -i 's/*[0-9]\+[A-Z]\+*/[0-9]\+[A-Z]\+/' *.dat
however on the nature of getting pd.read_csv() to actually import this file im at a loss and havent been able to get it to parse in anyway other than
df_final = pd.read_csv('super-duper-stacking-IVT7.dat', header=None)
edit2 clarified the file content vs file name above
Upvotes: 1
Views: 206
Reputation: 25489
You have correctly recognized that there is no ready-made solution for what you're trying to do. You're going to have to read the file line-by-line, and build a data structure containing the information you want.
You can use a regular expression to extract e.g. the 11DG
part of stacking-count-11DG.dat
Consider the following:
import re
import pandas as pd
# This regex captures anything after stacking-count- and before .dat
interaction_regex = re.compile(r"stacking-count-(.*?).dat")
all_data = [] # Empty list to hold all data
current_interaction = ""
with open("stacking-IVT7.dat") as f:
for line in f:
line = line.strip() # Strip the line
if not line: continue # Ig the line is empty, move to the next line
# If the line begins and ends with arrows, it is a filename so try to extract the interaction from it
if line.startswith("==>") and line.endswith("<=="):
inter = interaction_regex.findall(line)
if not inter: continue # if inter is empty, go to the next line
current_interaction = f"13-vs-{inter[0]}" # if not, set the currently active interaction
# If the line doesn't begin and end with arrows, try to extract data from it
# But only if current_interaction is not empty
elif current_interaction:
file_row = line.split() # Split the line on whitespace
if file_row[1] == "stacked":
# If the second element of the row is "stacked",
# Create a tuple containing the current_interaction and the number in this line
df_row = (current_interaction, float(file_row[0]))
all_data.append(df_row) # Append the tuple to our list
df = pd.DataFrame(all_data, columns=["Interaction", "IVT7"]) # Create a dataframe using the data we read
Which gives the following dataframe:
Interaction IVT7
0 13-vs-11DG 0.1178
1 13-vs-12DT 0.0679
2 13-vs-14DG 0.8299
Upvotes: 1