Reputation: 107
I have this text_string
list of strings. i want to extract all the text_i_want
as one string.
text_string = ['text_i_dont_want aaaaaa aaaaaaa', 'text_i_want', 'text_i_want\ntext_i_want.', 'text_i_want.', 'text_i_dont_want\text_i_dont_want', 'number_i_dont_want']
I want to match everything already compiled below [everything after "aaaaaa aaaaaaa ', '
"], until the second to last occurence of the following highlighted pattern: ', '
ans = re.compile(r'aaaaaa\s+aaaaaaa\',\s+\'(.*)', flags = re.DOTALL | re.MULTILINE)
ans_text = ans.search(str(text_string)).group(1)
returns
'text_i_want', 'text_i_want\ntext_i_want.', 'text_i_want.', 'text_i_dont_want\text_i_dont_want', 'number_i_dont_want'
So far, my regex is successfully matching the start of the string but is not stopping where i want it to (second to last pattern). I have no idea how to translate
until the second to last occurence of ', '
into regex language. Any help appreciated
Also, i want to do this with re
because i have hundreds of those lists, which are all equal, EXCEPT for the number of times 'text_i_want.'
is occuring.
Example list, where the solution code should also work:
text_list2 = ['text_i_dont_want aaaaaa aaaaaaa', 'text_i_want', 'text_i_want.', 'text_i_dont_want\text_i_dont_want', 'number_i_dont_want']
UPDATED QUESTION:
some lists have a different ending than i said. The good news is that they have a specific character in them, that makes them unique.
text_list3 = ['text_i_dont_want aaaaaa aaaaaaa', 'text_i_want', 'text_i_want.', 'ttttttt tttttt', 'text_i_dont_want', 'text_i_dont_want', 'text_i_dont_want']
rx = re.compile(r'aaa\s+aaa$')
ans = [i for (i, x) in enumerate(text_string) if rx.search(x)]
pattern = 'ttttttt tttttt'
if ans:
ans_text = text_list3[ans[0]+1:-2]
if pattern in text_list3:
ans_text = text_list3[ans[0]+1:-3]
Upvotes: 1
Views: 125
Reputation: 626689
You can find an item in the text_string
(which is not a string but a list) that ends with the aaaaaa aaaaaaa
string and then get all the items from the next one till the one you need with slicing:
text_string = ['text_i_dont_want aaaaaa aaaaaaa', 'text_i_want', 'text_i_want\ntext_i_want.', 'text_i_want.', 'text_i_dont_want\text_i_dont_want', 'number_i_dont_want']
start = [i for (i, x) in enumerate( text_string ) if x.endswith('aaaaaa aaaaaaa')][0]
print( text_string[start+1:-2] )
# => ['text_i_want', 'text_i_want\ntext_i_want.', 'text_i_want.']
See the Python demo
If you prefer to check the aaa
s with re
, you can use
import re
text_string = ['text_i_dont_want aaaaaa aaaaaaa', 'text_i_want', 'text_i_want\ntext_i_want.', 'text_i_want.', 'text_i_dont_want\text_i_dont_want', 'number_i_dont_want']
rx = re.compile(r'aaaaaa\s+aaaaaaa$')
start = [i for (i, x) in enumerate( text_string ) if rx.search(x)]
if start:
print( text_string[start[0]+1:-2] )
See this Python demo.
Upvotes: 1