Reputation: 22440
I've written a script in python to scrape some text out of some html elements. The script can parse it now. However, the problem is the results look weird with bunch of spaces between them. How can I fix it? Any help will be highly appreciated.
This is the html elements the text should be scraped from:
html="""
<div class="postal-address">
<p>11525 23 AVE</p>
<p>EDMONTON,
AB
,
T6J 4T3
</p>
<p><a rel="nofollow" href="mailto:[email protected]">[email protected]</a></p>
<p><a rel="nofollow" href="http://www.something.org" target="_blank">Visit our Web Site</a></p>
</div>
"""
This is the script I'm trying with:
from lxml.html import fromstring
root = fromstring(html)
address = [item.text for item in root.cssselect(".postal-address p")]
print(address)
Result I'm having:
11525 23 AVE, EDMONTON,\n AB\n ,\n T6J 4T3\n
Expected result:
11525 23 AVE EDMONTON, AB, T6J 4T3
I tried to apply .strip()
and .replace("\n","")
in this line [item.text for item in root.cssselect(".postal-address p")]
but it threw an error showing none type object
.
Btw, i do not wish to have any solution related to regex
. Thanks in advance.
Upvotes: 1
Views: 1163
Reputation: 52665
Try below solution and let me know in case of any issues:
address = [" ".join(item.text.split()).replace(" ,", ",") for item in root.cssselect(".postal-address p") if item.text]
Output:
['11525 23 AVE', 'EDMONTON, AB, T6J 4T3']
Upvotes: 1
Reputation: 55479
', '
as the separator.Like this:
src = '11525 23 AVE, EDMONTON,\n AB\n ,\n T6J 4T3\n'
print(', '.join([s.strip() for s in src.split(',')]))
output
11525 23 AVE, EDMONTON, AB, T6J 4T3
If you already have a list of strings, this is even easier:
address = [
'11525 23 AVE',
' EDMONTON',
'\n AB\n ',
'\n T6J 4T3\n'
]
print(', '.join([s.strip() for s in address]))
Upvotes: 0
Reputation: 784
when you do .replace("\n","") I think you have to escape the slash. This can be confusing sometimes and without trying it I can not tell you how many slasshes you need to escape it but try one of these....
.replace("\\n","")
.replace("\\\n","")
.replace("\\\\n","")
What happens when you use single quotes?
Upvotes: 0