nununu
nununu

Reputation: 29

How to extract substrings from a string with whitespaces in python using regex?

I have a long string that I got from webscraping using python. I wanna be able to get an output in a form like {'XXXXXXXX':'AAAAAAAA','YYYYYYYY':'BBBBBBBB} and hopefully put everything in a dataframe.

This is a sample of the very long string:

\\n    display:block\\u0022\\u003E\\n                                  div class= span_6\\u0022\\u003E\\n                                     li class=\\u0022borderbottom padleft pad20 nomargin\\u0022\\u003E\\n   span\\u003E1. XXXXXXXX\\/span\\u003E\\n                                strong class=\\u0022floatright\\u0022\\u003EAAAAAAAA\\/strong\\u003E\\n       \\/li\\u003E\\n                                                        li class=\\u0022borderbottom padleft pad20 nomargin\\u0022\\u003E\\n   span\\u003E2. YYYYYYYY\\/span\\u003E\\n                                strong class=\\u0022floatright\\u0022\\u003EBBBBBBBB\\/strong\\u003E\\n

#Blockquoting for clarity:

\n display:block\u0022\u003E\n
div class= span_6\u0022\u003E\n
li class=\u0022borderbottom padleft pad20 nomargin\u0022\u003E\n
span\u003E1. XXXXXXXX\/span\u003E\n
strong class=\u0022floatright\u0022\u003EAAAAAAAA\/strong\u003E\n
\/li\u003E\n
li class=\u0022borderbottom padleft pad20 nomargin\u0022\u003E\n
span\u003E2. YYYYYYYY\/span\u003E\n
strong class=\u0022floatright\u0022\u003EBBBBBBBB\/strong\u003E\n

I'm trying to do this:

#s = the string 
pattern = "u003E\|(.*?)\|\\/strong"
substring = re.search(pattern, s).group(1) 
print(substring)

but its failing. What's the best way to do this?

Edit: Expected output is two lists:

list1 = ['XXXXXXXX','YYYYYYYY']
list2 = ['AAAAAAAA','BBBBBBBB']

Upvotes: 0

Views: 135

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626748

You can use a solution like

import re
s = '\\n    display:block\\u0022\\u003E\\n                                  div class= span_6\\u0022\\u003E\\n                                     li class=\\u0022borderbottom padleft pad20 nomargin\\u0022\\u003E\\n   span\\u003E1. XXXXXXXX\\/span\\u003E\\n                                strong class=\\u0022floatright\\u0022\\u003EAAAAAAAA\\/strong\\u003E\\n       \\/li\\u003E\\n                                                        li class=\\u0022borderbottom padleft pad20 nomargin\\u0022\\u003E\\n   span\\u003E2. YYYYYYYY\\/span\\u003E\\n                                strong class=\\u0022floatright\\u0022\\u003EBBBBBBBB\\/strong\\u003E\\n'
unescaped_s = s.encode('latin-1', 'backslashreplace').decode('unicode-escape')
pattern = r">\d+\.\s*([^<>]*)\\/span>\s*[^>]*>([^<>]*)\\/strong"
substrings = re.findall(pattern, unescaped_s)
print(dict(substrings))

See the online Python demo. First, the string is unescaped, and the regex is applied to the unescaped input string version.

The regex is

>\d+\.\s*([^<>]*)\\/span>\s*[^>]*>([^<>]*)\\/strong

Details:

  • > - a > char
  • \d+ - one or more digits
  • \. - a dot
  • \s* - zero or more whitespaces
  • ([^<>]*) - Group 1: zero or more chars other than < and >
  • \\/span> - \/span> text
  • \s* - zero or more whitespaces
  • [^>]*> - any zero or more chars other than > and then a > char
  • ([^<>]*) - Group 2: zero or more chars other than < and >
  • \\/strong - a \/strong> text.

Upvotes: 2

Related Questions