staypuffinpc
staypuffinpc

Reputation: 337

treat '\xa0' as a space in regex in python

I am building a web scraper to get data using Selenium and BS4. I get the html after the page has fully loaded using Selenium. I then create a BeautifulSoup object using the page_source of the the page opened in Selenium. After that I start parsing the html to find specific elements on the page. I need to use regex to parse some strings. However, some non-breaking spaces (not all) are treated as '\xa0' characters. This presents a problem with python's regex searches (which don't treat these as spaces).

For example, I have the following string:

import re
import unicodedata
testString ="JM Wing\xa0- ...\xa0Transactions of the Royal Society A\xa0..., 2008 - royalsocietypublishing.org"

I've tried several of the following solutions, per what I've found online, but none of them seem to work. (Note that the original html.page_source returns a bytes object, so I tried to use decode.

testString = testString.replace(u'\xa0', u' ').encode('utf-8') #turns string into bytes
testString = testString.decode(encoding="utf-8",errors="ignore") #must be a bytes object
testString = unicodedata.normalize("NFKD", testString)

No matter what I try, I can't seem to get rid of the '\xa0' characters and the regex won't see these as spaces. Any idea how I might resolve this? I'd like my soup object to be in utf-8, as I'm potentially dealing with characters from multiple languages. But I really need to turn these into spaces so that I can use regex to parse strings that aren't marked up semantically in html.

EDIT: I can run the following and get the string I want, but the capturing parentheses don't appear to get just the portion I want (and so I get the "-...\ax0" and "," before and after the string).

foundString = re.search('-.*([a-zA-Z]*),',testString)[0]

this returns "...\xa0Transactions of the Royal Society A\xa0...,". Ideally, I'd like to only get the words "Transactions...Society" which is what the parentheses should indicate. Curiously, I can only get this result with re.search(). In contrast, re.findall() just returns an empty string.

Upvotes: 0

Views: 1287

Answers (2)

ifedapo olarewaju
ifedapo olarewaju

Reputation: 3441

This should work

testString = re.sub(r'\s',  ' ', testString)

\xa0 is treated as whitespace in the \s set so I believe the above snippet should solve the problem

Upvotes: 1

staypuffinpc
staypuffinpc

Reputation: 337

I sort of figured out a solution. Since the string that first comes across is html, the spaces in question are actually   So, after scraping the html and before turning it into soup, I use the following code to replace the   and then convert it to a byte string.

html = html.replace(" ",' ').encode('utf-8')

This seems to get rid of all instances of \xa0 thereafter.

The curious problem is the the capturing parentheses in the regex still aren't functioning and if I use re.findall I get an empty string.

Upvotes: 1

Related Questions