Reputation: 135
I've encountered a problem while trying to parse a complicated string. The string is really long and full of patterns but lets focus on what i need to take (and only that).
A substring from the huge string is:
... [span class=\"review-title\"]Wont open[/span] I have the GS5 and the game wont open. I got this game when i got the first droid. The fact that people havent been able to play since almost 2013 is bull. Please fix this or there isnt a point in even having the game on the server. [div class=\"review-link\" ...
Now I want to take the bold italic text, and i have the pattern, starts with [span class = ..]*[/span] desired text [div ... ] and this pattern repeates through the whole string.
How exactly do I take this specific text from the whole string and write it line after line?
Upvotes: 0
Views: 557
Reputation: 365717
From your comments ("im having trouble to solve, the original [
, ]
are <
, >
"), it's pretty clear that what you have is HTML.
Do not try to parse HTML with regex.
What you want here is an HTML parser. For example:
from bs4 import BeautifulSoup
soup = BeautifulSoup(huge_string)
for span in soup.find_all('span', class='review-title'):
text = span.next_sibling
print(text)
Even if what you have is HTML escaped in some way (backslash-escaped quotes, angle brackets turned into square brackets, etc.), you still don't want to parse it with regex. In that case, at most, you might want to use a regex as the preprocessor to turn it back into HTML to feed to an HTML parser.
Upvotes: 1
Reputation: 626804
This pattern should fetch you the string, just grab the Group 1 value:
r'\[span\b[^]]*class=[\\"\']*review-title\b[^]]*][^[]*\[/span\]\s*([^[]*)\[div\b'
Or a more generic one that does not check the class="review-link"
:
r'\[span\b[^]]*][^[]*\[/span\]\s*([^[]*)\[div\b'
Sample code at IDEONE:
import re
p = re.compile(ur'\[span\b[^]]*][^[]*\[/span\]\s*([^[]*)\[div\b')
test_str = u"[span class=\"review-title\"]Wont open[/span] I have the GS5 and the game wont open. I got this game when i got the first droid. The fact that people havent been able to play since almost 2013 is bull. Please fix this or there isnt a point in even having the game on the server. [div class=\"review-link\" "
print re.search(p, test_str).group(1)
Output:
I have the GS5 and the game wont open. I got this game when i got the first droid. The fact that people havent been able to play since almost 2013 is bull. Please fix this or there isnt a point in even having the game on the server.
EDIT: Since the [
s and ]
s are in fact <
s and >
s, here is an updated regex and code:
import re
p = re.compile(ur'<span\b[^>]*>[^<]*</span>\s*([^<]*)<div\b')
test_str = u"<span class=\"review-title\">Wont open</span> I have the GS5 and the game wont open. I got this game when i got the first droid. The fact that people havent been able to play since almost 2013 is bull. Please fix this or there isnt a point in even having the game on the server. <div class=\"review-link\" "
print [x.group(1) for x in re.finditer(p, test_str)]
A more specific regex to account for the class
attribute:
p = re.compile(ur'<span\b[^>]*class\s*=\s*[\\\'"]*review-title[^>]*>[^<]*</span>\s*([^<]*)<div\b')
Upvotes: 2
Reputation: 4887
It seems that you need just this regex:
(?<=\[/span\])[\s\S]*?(?=\[div)
Upvotes: 0