Reputation: 557
I'm working with a string that looks something like this (I save it from an error)
"['This is one' 'How is two' 'Why is three'\n 'When is four'] not in index"
From this string I would like to extract the substrings like this
['This is one', 'How is two', 'Why is three', 'When is four']
What I have done so far is to get the substrings (if the string is named s
);
start = s.index("[") + len("[")
end = s.index("]")
s = s[start:end].replace("\\n", "")
Which gives me the output
'This is one' 'How is two' 'Why is three' 'When is four'
Now I just need to insert them into a list, this is where I'm having problems. I've tried this
s = s.split("'")
But it gave me the output
['', 'This is one', ' ', 'How is two', ' ', 'Why is three', ' ', 'When is four', '']
I also tried
s = s.split("'")
s = ' '.join(s).split()
Which gave me the output
['This', 'is', 'one', 'How', 'is', 'two', 'Why', 'is', 'three', 'When', 'is', 'four']
And I've tried the same but .split(" ")
which gave me some weird whitespaces. I've also tried to use list(filter(...))
, but it doesn't remove the strings in the list that has whitespace in it, only the completely empty strings.
Upvotes: 1
Views: 56
Reputation: 520878
One approach would be to first extract the term in square brackets, then use re.findall
to find all single quoted terms.
inp = "['This is one' 'How is two' 'Why is three'\n 'When is four'] not in index"
srch = re.search(r'\[(.*)\]', inp, flags=re.DOTALL)
if srch:
matches = re.findall(r'\'(.*?)\'', srch.group(1))
print(matches)
Output:
['This is one', 'How is two', 'Why is three', 'When is four']
Note carefully in the call to re.search
that we use re.DOTALL
mode. This is required because the content in square brackets actually has a newline in it.
Upvotes: 2