Reputation: 6080
I have a big text file and I would like to extract only numbers that are after certain phrases/words.
There are dozens lines in this huge text file in the following format:
Best CV Model for car: 15778 is order:2 threshold: 0 with AUC of : 0.7185 gene aau_roc: 0.466281
One solution is to just look at number after "for car: X", "is order: X", "threshold: X", "Y gene aau_roc: X" !
At the end I would like to have 15778, 2, 0, 0.7185, 0.466281 for each line.
Upvotes: 5
Views: 3597
Reputation: 104712
Since you've already tagged your question with regex, I suspect you're already close to a solution. You can write a regular expression pattern that will match all the numbers on your line. Something like:
pattern = r"for car: (\d+) is order:(\d+) threshold: (\d+) with AUC of : ([0-9.]+) gene aau_roc: ([0-9.]+)"
Note, I've made this to exactly match your example string, including some odd spacing around the :
characters in a few places. Double check that it actually works with your real data.
To use this to do a search of your text file, I'd use re.finditer
to search over the whole text and return an iterable:
import re
for model, order, threshold, auc, aau_roc in re.finditer(pattern, text):
do_stuff()
Upvotes: 2
Reputation: 39698
re.match('(?<=for car: )/n*',the_line);
Just keep repeating for the other variables you need, and store them in your desired output.
Upvotes: 0
Reputation: 45542
>>> if line.startswith('Best CV Model'):
... re.findall(r'\d+\.{0,1}\d*', line)
...
['15778', '2', '0', '0.7185', '0.466281']
Upvotes: 4