vhs
vhs

Reputation: 10071

Normalize whitespace with Python

I'm building a data extract using scrapy and want to normalize a raw string pulled out of an HTML document. Here's an example string:

  Sapphire RX460 OC  2/4GB

Notice two groups of two whitespaces preceeding the string literal and between OC and 2.

Python provides trim as described in How do I trim whitespace with Python? But that won't handle the two spaces between OC and 2, which I need collapsed into a single space.

I've tried using normalize-space() from XPath while extracting data with my scrapy Selector and that works but the assignment verbose with strong rightward drift:

product_title = product.css('h3').xpath('normalize-space((text()))').extract_first()

Is there an elegant way to normalize whitespace using Python? If not a one-liner, is there a way I can break the above line into something easier to read without throwing an indentation error, e.g.

product_title = product.css('h3')
    .xpath('normalize-space((text()))')
    .extract_first()

Upvotes: 14

Views: 14677

Answers (4)

Christian Long
Christian Long

Reputation: 11524

The accepted answer is the right way to normalize the whitespace. This is an answer to your secondary question about formatting.

You also asked about how to format Python code across multiple lines without throwing an indentation error. You can do that in Python using parentheses. Here's what the example code from your question would look like formatted across several lines for readability.

product_title = (
    product.css("h3")
    .xpath("normalize-space((text()))")
    .extract_first()
)

Note that these parentheses don't create a tuple because there is no comma. The outer parentheses are just for formatting purposes.

The multi-line code above is exactly equivalent to chaining all the method calls together on one line.

product_title = product.css("h3").xpath("normalize-space((text()))").extract_first()

Upvotes: 2

hd1
hd1

Reputation: 34677

Instead of using regex's for this, a more efficient solution is to use the join/split option, observe:

>>> timeit.Timer((lambda:' '.join(' Sapphire RX460 OC  2/4GB'.split()))).timeit()
0.7263979911804199

>>> def f():
        return re.sub(" +", ' ', "  Sapphire RX460 OC  2/4GB").split()

>>> timeit.Timer(f).timeit()
4.163465976715088

Upvotes: 4

Tom Karzes
Tom Karzes

Reputation: 24052

You can use:

" ".join(s.split())

where s is your string.

Upvotes: 38

Tarun Lalwani
Tarun Lalwani

Reputation: 146540

You can use a function like below with regular expression to scan for continuous spaces and replace them by 1 space

import re

def clean_data(data):
    return re.sub(" {2,}", " ", data.strip())

product_title = clean(product.css('h3::text').extract_first())

And then improve clean function anyway you like it

Upvotes: 0

Related Questions