Reputation: 10071
I'm building a data extract using scrapy and want to normalize a raw string pulled out of an HTML document. Here's an example string:
Sapphire RX460 OC 2/4GB
Notice two groups of two whitespaces preceeding the string literal and between OC
and 2
.
Python provides trim as described in How do I trim whitespace with Python? But that won't handle the two spaces between OC
and 2
, which I need collapsed into a single space.
I've tried using normalize-space()
from XPath while extracting data with my scrapy Selector and that works but the assignment verbose with strong rightward drift:
product_title = product.css('h3').xpath('normalize-space((text()))').extract_first()
Is there an elegant way to normalize whitespace using Python? If not a one-liner, is there a way I can break the above line into something easier to read without throwing an indentation error, e.g.
product_title = product.css('h3')
.xpath('normalize-space((text()))')
.extract_first()
Upvotes: 14
Views: 14677
Reputation: 11524
The accepted answer is the right way to normalize the whitespace. This is an answer to your secondary question about formatting.
You also asked about how to format Python code across multiple lines without throwing an indentation error. You can do that in Python using parentheses. Here's what the example code from your question would look like formatted across several lines for readability.
product_title = (
product.css("h3")
.xpath("normalize-space((text()))")
.extract_first()
)
Note that these parentheses don't create a tuple because there is no comma. The outer parentheses are just for formatting purposes.
The multi-line code above is exactly equivalent to chaining all the method calls together on one line.
product_title = product.css("h3").xpath("normalize-space((text()))").extract_first()
Upvotes: 2
Reputation: 34677
Instead of using regex's for this, a more efficient solution is to use the join/split option, observe:
>>> timeit.Timer((lambda:' '.join(' Sapphire RX460 OC 2/4GB'.split()))).timeit()
0.7263979911804199
>>> def f():
return re.sub(" +", ' ', " Sapphire RX460 OC 2/4GB").split()
>>> timeit.Timer(f).timeit()
4.163465976715088
Upvotes: 4
Reputation: 146540
You can use a function like below with regular expression to scan for continuous spaces and replace them by 1 space
import re
def clean_data(data):
return re.sub(" {2,}", " ", data.strip())
product_title = clean(product.css('h3::text').extract_first())
And then improve clean function anyway you like it
Upvotes: 0