id345678
id345678

Reputation: 107

Regex: match until the second to last occurences of pattern

I have this text_string list of strings. i want to extract all the text_i_want as one string.

text_string = ['text_i_dont_want aaaaaa aaaaaaa', 'text_i_want', 'text_i_want\ntext_i_want.', 'text_i_want.', 'text_i_dont_want\text_i_dont_want', 'number_i_dont_want']

I want to match everything already compiled below [everything after "aaaaaa aaaaaaa ', '"], until the second to last occurence of the following highlighted pattern: ', '

ans = re.compile(r'aaaaaa\s+aaaaaaa\',\s+\'(.*)', flags = re.DOTALL | re.MULTILINE)
ans_text = ans.search(str(text_string)).group(1)

returns

'text_i_want', 'text_i_want\ntext_i_want.', 'text_i_want.', 'text_i_dont_want\text_i_dont_want', 'number_i_dont_want'

So far, my regex is successfully matching the start of the string but is not stopping where i want it to (second to last pattern). I have no idea how to translate

until the second to last occurence of ', '

into regex language. Any help appreciated

Also, i want to do this with re because i have hundreds of those lists, which are all equal, EXCEPT for the number of times 'text_i_want.' is occuring.

Example list, where the solution code should also work:

text_list2 = ['text_i_dont_want aaaaaa aaaaaaa', 'text_i_want', 'text_i_want.', 'text_i_dont_want\text_i_dont_want', 'number_i_dont_want']

UPDATED QUESTION:

some lists have a different ending than i said. The good news is that they have a specific character in them, that makes them unique.

text_list3 = ['text_i_dont_want aaaaaa aaaaaaa', 'text_i_want', 'text_i_want.', 'ttttttt tttttt', 'text_i_dont_want', 'text_i_dont_want', 'text_i_dont_want']

    rx = re.compile(r'aaa\s+aaa$')
    ans = [i for (i, x) in enumerate(text_string) if rx.search(x)]
    pattern = 'ttttttt tttttt'
    if ans:
        ans_text = text_list3[ans[0]+1:-2]
        if pattern in text_list3:
            ans_text = text_list3[ans[0]+1:-3]

Upvotes: 1

Views: 125

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626689

You can find an item in the text_string (which is not a string but a list) that ends with the aaaaaa aaaaaaa string and then get all the items from the next one till the one you need with slicing:

text_string = ['text_i_dont_want aaaaaa aaaaaaa', 'text_i_want', 'text_i_want\ntext_i_want.', 'text_i_want.', 'text_i_dont_want\text_i_dont_want', 'number_i_dont_want']
start = [i for (i, x) in enumerate( text_string ) if x.endswith('aaaaaa aaaaaaa')][0]
print( text_string[start+1:-2] )
# => ['text_i_want', 'text_i_want\ntext_i_want.', 'text_i_want.']

See the Python demo

If you prefer to check the aaas with re, you can use

import re
text_string = ['text_i_dont_want aaaaaa aaaaaaa', 'text_i_want', 'text_i_want\ntext_i_want.', 'text_i_want.', 'text_i_dont_want\text_i_dont_want', 'number_i_dont_want']
rx = re.compile(r'aaaaaa\s+aaaaaaa$')
start = [i for (i, x) in enumerate( text_string ) if rx.search(x)]
if start:
    print( text_string[start[0]+1:-2] )

See this Python demo.

Upvotes: 1

Related Questions