Reputation: 23
I'm trying to parse a string to find all of the characters between two delimiters <code>
and </code>
.
I have attempted using regular expressions, but I can't seem to understand what is going on.
my attempt:
import re
re.findall('<code>(.*?)</code>', processed_df['question'][2])
where processed_df['question'][2]
is the string (this string is continuous, I typed it into multiple lines for readability):
'<code>for x in finallist:\n matchinfo =
requests.get("https://api.opendota.com/api/matches/{}".format(x)).json()
["match_id"]\n print(matchinfo)\n</code>'
I have tested with this test_string:
test_string = '<code> this is a test </code>'
and it seems to work.
I have a feeling it has to do with special characters within the characters between <code>
and </code>
, but I don't know how to fix it. Thank you for the help!
Upvotes: 0
Views: 59
Reputation: 1326
I think the issue is the newline \n character, just make sure to match using the DOTALL
flag such as
import re
regex = r"<code>(.*)\<\/code>"
test_str = ("<code>for x in finallist:\\n matchinfo = \n"
" requests.get(\"https://api.opendota.com/api/matches/{}\".format(x)).json() \n"
" [\"match_id\"]\\n print(matchinfo)\\n</code>\n")
re.findall(regex, test_str, re.DOTALL)
'for x in finallist:\\n matchinfo = \n requests.get("https://api.opendota.com/api/matches/{}".format(x)).json() \n ["match_id"]\\n print(matchinfo)\\n'
Upvotes: 2
Reputation: 1801
your might be better of with an html parser than regex
import lxml.html
html_snippet = """
...
<p>Some stuff</p>
...
<code>for x in finallist:\n matchinfo =
requests.get("https://api.opendota.com/api/matches/{}".format(x)).json()
["match_id"]\n print(matchinfo)\n</code>
...
And some Stuff
...
another code block <br />
<code>
print('Hello world')
</code>
"""
dom = lxml.html.fromstring(html_snippet)
codes = dom.xpath('//code')
for code in codes:
print(code.text)
>>>> for x in finallist:
>>>> matchinfo =
>>>> requests.get("https://api.opendota.com/api/matches/{}".format(x)).json()
>>>> ["match_id"]
>>>> print(matchinfo)
>>>> print('Hello world')
Upvotes: 3
Reputation: 427
So the question doesn't explicitly say it needs regular expresions
. With that said, I would say not using them is best:
eg
test_str = '''
<code>asldkfj
asdlkfjas
asdlkf
for i in range(asdlkf):
print("Hey")
if i == 8:
print(i)
</code>
'''
start = len('<code>')
end = len('</code>')
new_str = test_str.strip()[start:-end] # Should have everything in between <code></code>
Upvotes: 1