SIM
SIM

Reputation: 22440

Unable to remove spaces between scraped text

I've written a script in python to scrape some text out of some html elements. The script can parse it now. However, the problem is the results look weird with bunch of spaces between them. How can I fix it? Any help will be highly appreciated.

This is the html elements the text should be scraped from:

html="""
<div class="postal-address">
        <p>11525 23 AVE</p>


        <p>EDMONTON,
        AB
        ,
        T6J 4T3
        </p>

        <p><a rel="nofollow" href="mailto:[email protected]">[email protected]</a></p>
        <p><a rel="nofollow" href="http://www.something.org" target="_blank">Visit our Web Site</a></p>
    </div>
"""

This is the script I'm trying with:

from lxml.html import fromstring

root = fromstring(html)
address = [item.text for item in root.cssselect(".postal-address p")]
print(address)

Result I'm having:

11525 23 AVE, EDMONTON,\n        AB\n        ,\n        T6J 4T3\n

Expected result:

11525 23 AVE EDMONTON, AB, T6J 4T3

I tried to apply .strip() and .replace("\n","") in this line [item.text for item in root.cssselect(".postal-address p")] but it threw an error showing none type object.

Btw, i do not wish to have any solution related to regex. Thanks in advance.

Upvotes: 1

Views: 1163

Answers (3)

Andersson
Andersson

Reputation: 52665

Try below solution and let me know in case of any issues:

address = [" ".join(item.text.split()).replace(" ,", ",") for item in root.cssselect(".postal-address p") if item.text]

Output:

['11525 23 AVE', 'EDMONTON, AB, T6J 4T3']

Upvotes: 1

PM 2Ring
PM 2Ring

Reputation: 55479

  1. Split the source string on commas.
  2. Strip off any leading or trailing whitespace from each string in the resulting list.
  3. Join the strings using ', ' as the separator.

Like this:

src = '11525 23 AVE, EDMONTON,\n        AB\n        ,\n        T6J 4T3\n'
print(', '.join([s.strip() for s in src.split(',')]))

output

11525 23 AVE, EDMONTON, AB, T6J 4T3

If you already have a list of strings, this is even easier:

address = [
    '11525 23 AVE', 
    ' EDMONTON', 
    '\n        AB\n        ', 
    '\n        T6J 4T3\n'
]

print(', '.join([s.strip() for s in address]))

Upvotes: 0

Br&#233;nt Russęll
Br&#233;nt Russęll

Reputation: 784

when you do .replace("\n","") I think you have to escape the slash. This can be confusing sometimes and without trying it I can not tell you how many slasshes you need to escape it but try one of these....

.replace("\\n","")
.replace("\\\n","")
.replace("\\\\n","")

What happens when you use single quotes?

Upvotes: 0

Related Questions