Reputation: 29
I have a long string that I got from webscraping using python. I wanna be able to get an output in a form like {'XXXXXXXX':'AAAAAAAA','YYYYYYYY':'BBBBBBBB}
and hopefully put everything in a dataframe.
This is a sample of the very long string:
\\n display:block\\u0022\\u003E\\n div class= span_6\\u0022\\u003E\\n li class=\\u0022borderbottom padleft pad20 nomargin\\u0022\\u003E\\n span\\u003E1. XXXXXXXX\\/span\\u003E\\n strong class=\\u0022floatright\\u0022\\u003EAAAAAAAA\\/strong\\u003E\\n \\/li\\u003E\\n li class=\\u0022borderbottom padleft pad20 nomargin\\u0022\\u003E\\n span\\u003E2. YYYYYYYY\\/span\\u003E\\n strong class=\\u0022floatright\\u0022\\u003EBBBBBBBB\\/strong\\u003E\\n
#Blockquoting for clarity:
\n display:block\u0022\u003E\n
div class= span_6\u0022\u003E\n
li class=\u0022borderbottom padleft pad20 nomargin\u0022\u003E\n
span\u003E1. XXXXXXXX\/span\u003E\n
strong class=\u0022floatright\u0022\u003EAAAAAAAA\/strong\u003E\n
\/li\u003E\n
li class=\u0022borderbottom padleft pad20 nomargin\u0022\u003E\n
span\u003E2. YYYYYYYY\/span\u003E\n
strong class=\u0022floatright\u0022\u003EBBBBBBBB\/strong\u003E\n
I'm trying to do this:
#s = the string
pattern = "u003E\|(.*?)\|\\/strong"
substring = re.search(pattern, s).group(1)
print(substring)
but its failing. What's the best way to do this?
Edit: Expected output is two lists:
list1 = ['XXXXXXXX','YYYYYYYY']
list2 = ['AAAAAAAA','BBBBBBBB']
Upvotes: 0
Views: 135
Reputation: 626748
You can use a solution like
import re
s = '\\n display:block\\u0022\\u003E\\n div class= span_6\\u0022\\u003E\\n li class=\\u0022borderbottom padleft pad20 nomargin\\u0022\\u003E\\n span\\u003E1. XXXXXXXX\\/span\\u003E\\n strong class=\\u0022floatright\\u0022\\u003EAAAAAAAA\\/strong\\u003E\\n \\/li\\u003E\\n li class=\\u0022borderbottom padleft pad20 nomargin\\u0022\\u003E\\n span\\u003E2. YYYYYYYY\\/span\\u003E\\n strong class=\\u0022floatright\\u0022\\u003EBBBBBBBB\\/strong\\u003E\\n'
unescaped_s = s.encode('latin-1', 'backslashreplace').decode('unicode-escape')
pattern = r">\d+\.\s*([^<>]*)\\/span>\s*[^>]*>([^<>]*)\\/strong"
substrings = re.findall(pattern, unescaped_s)
print(dict(substrings))
See the online Python demo. First, the string is unescaped, and the regex is applied to the unescaped input string version.
The regex is
>\d+\.\s*([^<>]*)\\/span>\s*[^>]*>([^<>]*)\\/strong
Details:
>
- a >
char\d+
- one or more digits\.
- a dot\s*
- zero or more whitespaces([^<>]*)
- Group 1: zero or more chars other than <
and >
\\/span>
- \/span>
text\s*
- zero or more whitespaces[^>]*>
- any zero or more chars other than >
and then a >
char([^<>]*)
- Group 2: zero or more chars other than <
and >
\\/strong
- a \/strong>
text.Upvotes: 2