Reputation: 1252
Though I'd just update the start of this question for people that come across it in future. Regex was not the optimal solution for my particular problem, but trying to regex complicated and separated patterns (my logic from the start) in one go wasn't ideal.
The answer to the question as stated would be to try separate regexes I think, and 'filter out' the stuff needed.
My file could be worked on with the pandas.read_fwf()
solution for optimal results so I chose that as the full answer.
I'm sure this has been asked somewhere before but I can't find a question that is exactly trying to do what I want - so my apologies in advance.
TLDR How would you regex for several different patterns in a line that are not located next to each other, or properly delimited? Am I wrong to be trying to do this in one move?
I have some strings in a pretty verbose file (see end of post) that I want to pull out. I want multiple bits of information from different columns within a line (though they are not properly delimited).
I know I can get this into match.group()
which will be perfect (because I intend to use each element I pull out later in isolation), except I can't figure out how to match several substrings that are physically separated from other another in the string (unless trying to do this is one go is just wrong?).
I can extract the table part that I want with some simple regex no problem:
#!/usr/bin/python
import re
hhresult_file = sys.argv[1] # The above file
regex = re.compile(r'\s*\d{1,2}\s\w{4}_\w\s.*') # Will match the whole line (my first shot at the problem)
def main():
with open(hhresult_file, 'r') as result_fasta:
lines = result_fasta.readlines()
for line in lines:
match = re.search(regex,line)
if match:
print(match.group())
if __name__ == '__main__' :
main()
But I'm also trying to pull out the columns which read "Hit" "Prob" "E-Value" "P-Value".
I think I can synthesise the required regexes for each individual fields (there are some nuances like the switch between exponentiated SI values and floats for example).
What I don't know how to do is 'disregard' regions of the string? Specifically, I can't get the 'Hit' (= 3izo_F) and then the 'Prob' field because of the hit description in the intervening space.
I was trying to go about it with grouped regexes, but without being physically adjacent it doesn't work (something like these, though there may be errors in them):
regex = re.compile(r'''
(\w{4}_\w) # Match the hit
(\d{1,3}\.\d') # Match the probability score
(\d\.?\d?|\d\.?\d?E-\d\d|\d\.\d*) # E value as float/E-
(\d\.?\d?|\d\.?\d?E-\d\d|\d\.\d*) # Match SI or float P value
(\d+\.\d+) # Match the score
''',re.VERBOSE)
The file in question:
Query PAU_03380 PAU_03380 hypothetical protein 3919442:3920968 reverse MW:51681
Match_columns 508
No_of_seqs 1 out of 1
Neff 1.0
Searched_HMMs 37488
Date Mon May 23 20:23:54 2016
Command hhsearch -cpu 10 -i /home/wms_joe/PVCs/PVC_operons/prot_all/PAU_03380.faa -d /home/wms_joe/Applications/HHSuite/databases/pdb70/pdb70_hhm.ffdata -B 5 -Z 5 -E 1E-03 -nocons -nopred -nodssp
No Hit Prob E-value P-value Score SS Cols Query HMM Template HMM
1 3izo_F Fiber; pentameric pento 98.1 2.7E-09 7.3E-14 107.6 0.0 65 93-160 104-168 (581)
2 3izo_F Fiber; pentameric pento 97.6 1.3E-07 3.4E-12 95.6 0.0 156 156-317 210-388 (581)
3 1ocy_A Bacteriophage T4 short 97.6 1.8E-07 4.7E-12 80.4 0.0 85 323-418 10-122 (198)
4 1v1h_A Fibritin, fiber protein 96.1 0.00011 3E-09 60.4 0.0 30 167-198 2-31 (103)
5 1v1h_A Fibritin, fiber protein 95.9 0.00019 5.1E-09 59.1 0.0 10 168-177 41-50 (103)
6 1pdi_A Short tail fiber protei 95.6 0.00041 1.1E-08 63.3 0.0 26 323-348 90-116 (278)
7 2xgf_A Long tail fiber protein 94.1 0.005 1.3E-07 55.1 0.0 31 318-348 22-52 (242)
8 1h6w_A Bacteriophage T4 short 84.7 0.25 6.7E-06 47.1 0.0 27 323-349 255-282 (312)
9 1qiu_A Adenovirus fibre; fibre 79.9 0.54 1.4E-05 44.4 0.0 24 92-115 7-30 (264)
10 3s6x_A Outer capsid protein si 72.0 1.3 3.4E-05 43.6 0.0 69 106-191 44-112 (325)
No 1
>3izo_F Fiber; pentameric penton base, trimeri viral protein; 3.60A {Human adenovirus 5}
Probab=98.13 E-value=2.7e-09 Score=107.58 Aligned_cols=65 Identities=22% Similarity=0.362 Sum_probs=42.7
Q PAU_03380 93 PLILKDDVLSVDLGSGLTNETNGICVGQGDGITVNTSNVAVKQGNGISVTSSGGVAVKVSANKGLSVD 160 (508)
||-+.++-|.++....|+...+++.+--+++++|+.....++....++++ .+++++++. .||.++
T 3izo_F 104 PLTVTSEALTVAAAAPLMVAGNTLTMQSQAPLTVHDSKLSIATQGPLTVS-EGKLALQTS--GPLTTT 168 (581)
Confidence 55555556666666667777777777777777777776777777777764 566666554 355554
No 2
>3izo_F Fiber; pentameric penton base, trimeri viral protein; 3.60A {Human adenovirus 5}
Probab=97.60 E-value=1.3e-07 Score=95.57 Aligned_cols=156 Identities=19% Similarity=0.323 Sum_probs=85.6
Q PAU_03380 156 GLSVDSSGVAVKVNTDKGISVDGNGVAVKVNTSKGISVDNTGVAVIANASKGISVDGSGV--------------AVIANT 221 (508)
.|.+..++-.+.+++..|+.|.++.+.+|+ ..++.+++.|- +-.+...|+.++...- .+..+.
T 3izo_F 210 PLHVTDDLNTLTVATGPGVTINNTSLQTKV--TGALGFDSQGN-MQLNVAGGLRIDSQNRRLILDVSYPFDAQNQLNLRL 286 (581)
Confidence 344544434556666667777666655443 23333333221 1111222333332211 234445
It goes on a bit but is just more of the above 2 alignments.
UPDATE 1
Just to provide an example of what I'd ideally like at the end:
Given the line in the 'short table':
1 3izo_F Fiber; pentameric pento 98.1 2.7E-09 7.3E-14 107.6 0.0 65 93-160 104-168 (581)
I'd like to get either a delimited string, or separate match.group
for:
The PDB Hit ID == 3izo_F
Each of the first 4 metrics (as separate groups ideally, but I could deal with that after the fact) = 98.1
2.7E-09
7.3E-14
107.6
Such a shame this program doesn't just provide a proper tabular output :(
Upvotes: 1
Views: 179
Reputation: 18995
It is possible to use pandas.read_fwf
to read the tabular portion, but because your table headers are malformed (i.e. sometimes a space is part of a variable name, as in Query HMM
, and sometimes it separates variable names, as in SS
and Cols
) you are going to have to specify the column widths.
I like to use a template row to do this.
from io import StringIO
yourTemplate= \
"""
---|-------------------------------|----|-------|-------|------|-----|----|---------|--------------|
No Hit Prob E-value P-value Score SS Cols Query HMM Template HMM
1 3izo_F Fiber; pentameric pento 98.1 2.7E-09 7.3E-14 107.6 0.0 65 93-160 104-168 (581)
2 3izo_F Fiber; pentameric pento 97.6 1.3E-07 3.4E-12 95.6 0.0 156 156-317 210-388 (581)
"""
yourPattern = StringIO(yourTemplate).readlines()[1]
colBreaks = [i for i, ch in enumerate(yourPattern) if ch == '|']
yourWidths = [j-i for i, j in zip( ([0]+colBreaks)[:-1], colBreaks ) ]
Then we can go back to your file.
yourText= \
"""Neff 1.0
Searched_HMMs 37488
Date Mon May 23 20:23:54 2016
Command hhsearch -cpu 10 -i /home/wms_joe/PVCs/PVC_operons/prot_all/PAU_03380.faa -d /home/wms_joe/Applications/HHSuite/databases/pdb70/pdb70_hhm.ffdata -B 5 -Z 5 -E 1E-03 -nocons -nopred -nodssp
No Hit Prob E-value P-value Score SS Cols Query HMM Template HMM
1 3izo_F Fiber; pentameric pento 98.1 2.7E-09 7.3E-14 107.6 0.0 65 93-160 104-168 (581)
2 3izo_F Fiber; pentameric pento 97.6 1.3E-07 3.4E-12 95.6 0.0 156 156-317 210-388 (581)
3 1ocy_A Bacteriophage T4 short 97.6 1.8E-07 4.7E-12 80.4 0.0 85 323-418 10-122 (198)
4 1v1h_A Fibritin, fiber protein 96.1 0.00011 3E-09 60.4 0.0 30 167-198 2-31 (103)
5 1v1h_A Fibritin, fiber protein 95.9 0.00019 5.1E-09 59.1 0.0 10 168-177 41-50 (103)
6 1pdi_A Short tail fiber protei 95.6 0.00041 1.1E-08 63.3 0.0 26 323-348 90-116 (278)
7 2xgf_A Long tail fiber protein 94.1 0.005 1.3E-07 55.1 0.0 31 318-348 22-52 (242)
8 1h6w_A Bacteriophage T4 short 84.7 0.25 6.7E-06 47.1 0.0 27 323-349 255-282 (312)
9 1qiu_A Adenovirus fibre; fibre 79.9 0.54 1.4E-05 44.4 0.0 24 92-115 7-30 (264)
10 3s6x_A Outer capsid protein si 72.0 1.3 3.4E-05 43.6 0.0 69 106-191 44-112 (325)
No 1
>3izo_F Fiber; pentameric penton base, trimeri viral protein; 3.60A {Human adenovirus 5}
Probab=98.13 E-value=2.7e-09 Score=107.58 Aligned_cols=65 Identities=22% Similarity=0.362 Sum_probs=42.7
Q PAU_03380 93 PLILKDDVLSVDLGSGLTNETNGICVGQGDGITVNTSNVAVKQGNGISVTSSGGVAVKVSANKGLSVD 160 (508)
||-+.++-|.++....|+...+++.+--+++++|+.....++....++++ .+++++++. .||.++
T 3izo_F 104 PLTVTSEALTVAAAAPLMVAGNTLTMQSQAPLTVHDSKLSIATQGPLTVS-EGKLALQTS--GPLTTT 168 (581)
Confidence 55555556666666667777777777777777777776777777777764 566666554 355554
No 2
>3izo_F Fiber; pentameric penton base, trimeri viral protein; 3.60A {Human adenovirus 5}
Probab=97.60 E-value=1.3e-07 Score=95.57 Aligned_cols=156 Identities=19% Similarity=0.323 Sum_probs=85.6
Q PAU_03380 156 GLSVDSSGVAVKVNTDKGISVDGNGVAVKVNTSKGISVDNTGVAVIANASKGISVDGSGV--------------AVIANT 221 (508)
.|.+..++-.+.+++..|+.|.++.+.+|+ ..++.+++.|- +-.+...|+.++...- .+..+.
T 3izo_F 210 PLHVTDDLNTLTVATGPGVTINNTSLQTKV--TGALGFDSQGN-MQLNVAGGLRIDSQNRRLILDVSYPFDAQNQLNLRL 286 (581)
Confidence 344544434556666667777666655443 23333333221 1111222333332211 234445
"""
We note that to get to the tabular portion (starting with the header) we need to skip 5 rows, then keep 10 rows.
import pandas as pd
yourData = pd.read_fwf(StringIO(yourText), skiprows=5, nrows=10, header=0, widths = yourWidths)
print(yourData.dtypes)
print(yourData)
This should give you what you want, in tabular form:
print(yourData.dtypes)
print(yourData)
No int64
Hit object
Prob float64
E-value float64
P-value float64
Score float64
SS float64
Cols int64
Query HMM object
Template HMM object
dtype: object
No Hit Prob E-value P-value \
0 1 3izo_F Fiber; pentameric pento 98.1 2.700000e-09 7.300000e-14
1 2 3izo_F Fiber; pentameric pento 97.6 1.300000e-07 3.400000e-12
2 3 1ocy_A Bacteriophage T4 short 97.6 1.800000e-07 4.700000e-12
3 4 1v1h_A Fibritin, fiber protein 96.1 1.100000e-04 3.000000e-09
4 5 1v1h_A Fibritin, fiber protein 95.9 1.900000e-04 5.100000e-09
5 6 1pdi_A Short tail fiber protei 95.6 4.100000e-04 1.100000e-08
6 7 2xgf_A Long tail fiber protein 94.1 5.000000e-03 1.300000e-07
7 8 1h6w_A Bacteriophage T4 short 84.7 2.500000e-01 6.700000e-06
8 9 1qiu_A Adenovirus fibre; fibre 79.9 5.400000e-01 1.400000e-05
9 10 3s6x_A Outer capsid protein si 72.0 1.300000e+00 3.400000e-05
Score SS Cols Query HMM Template HMM
0 107.6 0.0 65 93-160 104-168 (581)
1 95.6 0.0 156 156-317 210-388 (581)
2 80.4 0.0 85 323-418 10-122 (198)
3 60.4 0.0 30 167-198 2-31 (103)
4 59.1 0.0 10 168-177 41-50 (103)
5 63.3 0.0 26 323-348 90-116 (278)
6 55.1 0.0 31 318-348 22-52 (242)
7 47.1 0.0 27 323-349 255-282 (312)
8 44.4 0.0 24 92-115 7-30 (264)
9 43.6 0.0 69 106-191 44-112 (325)
The pandas
syntax to access these values is quite straightforward, as in yourData.loc[3,'Prob']
Upvotes: 1
Reputation: 5304
You have two parts in your data file. One is compact table:
1 3izo_F Fiber; pentameric pento 98.1 2.7E-09 7.3E-14 107.6 0.0 65 93-160 104-168 (581)
2 3izo_F Fiber; pentameric pento 97.6 1.3E-07 3.4E-12 95.6 0.0 156 156-317 210-388 (581)
3 1ocy_A Bacteriophage T4 short 97.6 1.8E-07 4.7E-12 80.4 0.0 85
Fields in it have fixed position. So instead of regular expressions you can use simple substrings:
line[4:34] for hit
line[36:40] for prob
But that table has trimmed hit field. If you want it's full content you have to parse second part of the file. And multiline regular expressions are a good choice for that. This one finds hit, probability and E-value, fill free to expand it.
re.compile(r"No \d*\n>([^\n]*)\nProbab=([\d\.e\-]*).*E-value=([\d\.e\-]*).*", re.MULTILINE)
But that part of the file does not contain P-value. So it seems that you will have to combine these methods.
Upvotes: 1