Reputation: 31
how to extract the text between two known words in a string with a condition that the text between these words can be i) 1 character ii) 1 word iii) 2 words etc.?
Sample Text:
text = ("MNOTES - GEO GEO MNOTES 20 231-0005 GEO GEO GEO GEO GEO MNOTES SOME REVISION MNOTES CASUAL C GEO GEO GEO GEO GEO MNOTES F232322500 MNOTES HELP PAGES GEO GEO GEO GEO MNOTES SHEET 1 OF 3 GEO GEO MNOTES CASUAL E. GEO GEO MNOTES SITPOPE/TIN AY GEO GEO MNOTES R GEO GEO GEO GEO MNOTES 22+0436/T.SKI/11-AUG-1986 GEO GEO GEO GEO MNOTES 231-0045 GEO")
I have a string like above that have multiple occurrences of these two known words 'MNOTES'
and 'GEO'
, however the text between them can be anything and any number of words.
I wanted to extract sometimes the text that has only one character between those two known words or sometimes the text that has 2 words between those two known words or sometimes the text that has 6 words between those two known words etc., So, how can i extract along with the condition ?
Upvotes: 3
Views: 4610
Reputation: 27515
Use re.findall
.
import re
re.findall('MNOTES(.*?)GEO', text)
This results in:
[' - ', ' 20 231-0005 ', ' SOME REVISION MNOTES CASUAL C ', ' F232322500 MNOTES HELP PAGES ', ' SHEET 1 OF 3 ', ' CASUAL E. ', ' SITPOPE/TIN AY ', ' R ', ' 22+0436/T.SKI/11-AUG-1986 ', ' 231-0045 ']
Edit
To get a specific amount of characters the following will work:
re.findall('MNOTES\s?(.{1})\s?GEO', text)
Results in
['-', 'R']
and to get only results that are 6-8 characters long:
re.findall('MNOTES\s?(.{6,8})\s?GEO', text)
Results:
['- GEO ', 'CASUAL C', 'R GEO ', '231-0045']
Upvotes: 4