Reputation: 337
I am building a web scraper to get data using Selenium and BS4. I get the html after the page has fully loaded using Selenium. I then create a BeautifulSoup object using the page_source of the the page opened in Selenium. After that I start parsing the html to find specific elements on the page. I need to use regex to parse some strings. However, some non-breaking spaces (not all) are treated as '\xa0' characters. This presents a problem with python's regex searches (which don't treat these as spaces).
For example, I have the following string:
import re
import unicodedata
testString ="JM Wing\xa0- ...\xa0Transactions of the Royal Society A\xa0..., 2008 - royalsocietypublishing.org"
I've tried several of the following solutions, per what I've found online, but none of them seem to work. (Note that the original html.page_source returns a bytes object, so I tried to use decode.
testString = testString.replace(u'\xa0', u' ').encode('utf-8') #turns string into bytes
testString = testString.decode(encoding="utf-8",errors="ignore") #must be a bytes object
testString = unicodedata.normalize("NFKD", testString)
No matter what I try, I can't seem to get rid of the '\xa0' characters and the regex won't see these as spaces. Any idea how I might resolve this? I'd like my soup object to be in utf-8, as I'm potentially dealing with characters from multiple languages. But I really need to turn these into spaces so that I can use regex to parse strings that aren't marked up semantically in html.
EDIT: I can run the following and get the string I want, but the capturing parentheses don't appear to get just the portion I want (and so I get the "-...\ax0" and "," before and after the string).
foundString = re.search('-.*([a-zA-Z]*),',testString)[0]
this returns "...\xa0Transactions of the Royal Society A\xa0...,". Ideally, I'd like to only get the words "Transactions...Society" which is what the parentheses should indicate. Curiously, I can only get this result with re.search(). In contrast, re.findall() just returns an empty string.
Upvotes: 0
Views: 1287
Reputation: 3441
This should work
testString = re.sub(r'\s', ' ', testString)
\xa0
is treated as whitespace in the \s
set so I believe the above snippet should solve the problem
Upvotes: 1
Reputation: 337
I sort of figured out a solution. Since the string that first comes across is html, the spaces in question are actually
So, after scraping the html and before turning it into soup, I use the following code to replace the
and then convert it to a byte string.
html = html.replace(" ",' ').encode('utf-8')
This seems to get rid of all instances of \xa0 thereafter.
The curious problem is the the capturing parentheses in the regex still aren't functioning and if I use re.findall I get an empty string.
Upvotes: 1