Reputation: 196
I am scraping a site for a bunch of links and those links are in single HTML div
tag with <br />
tag to line break, but when I try to get all URLs from that div
it just coming in a single string.
I am unable to separate then in list
. My code is as follows:
with below code I'm scraping all links:
links = soup.find('div', id='dle-content').find('div', class_='full').find(
'div', class_='full-news').find('div', class_='quote').text
Following is html from site:
<div class="quote">
<!--QuoteEBegin-->
https://example.com/asd.html
<br>
https://example.net/abc
<br>
https://example.org/v/kjg/
<br>
<br>
<!--QuoteEEnd-->
</div>
Output which I get from above code:
https://example.com/asd.htmlhttps://example.net/abchttps://example.org/v/kjg/
Output which I want:
[
"https://example.com/asd.html",
"https://example.net/abc",
"https://example.org/v/kjg/"
]
Upvotes: 0
Views: 67
Reputation: 164
Another way to achieve the desired output:
from bs4 import BeautifulSoup
html = """
<div class="quote">
<!--QuoteEBegin-->
https://example.com/asd.html
<br>
https://example.net/abc
<br>
https://example.org/v/kjg/
<br>
<br>
<!--QuoteEEnd-->
</div>
"""
soup = BeautifulSoup(html,"html.parser")
print([i.strip() for i in soup.find("div",class_="quote").strings if i!='\n'])
Output:
['https://example.com/asd.html', 'https://example.net/abc', 'https://example.org/v/kjg/']
Upvotes: 0
Reputation: 28630
Split the string, and then use list comprehension to bring it together:
output = 'https://example.com/asd.htmlhttps://example.net/abchttps://example.org/v/kjg/'
split_output = output.split()
new_output = [x for x in split_output if x != '']
Output:
print(new_output)
['https://example.com/asd.html', 'https://example.net/abc', 'https://example.org/v/kjg/']
Upvotes: 0
Reputation: 2260
You could fix it with a string manipulation:
new_output = ' http'.join(output.split('http')).split()
Upvotes: 0
Reputation: 20052
Try this:
from bs4 import BeautifulSoup
sample = """<div class="quote">
<!--QuoteEBegin-->
https://example.com/asd.html
<br>
https://example.net/abc
<br>
https://example.org/v/kjg/
<br>
<br>
<!--QuoteEEnd-->
</div>"""
soup = BeautifulSoup(sample, "html.parser").find_all("div", class_="quote")
print([i.getText().split() for i in soup])
Output:
[['https://example.com/asd.html', 'https://example.net/abc', 'https://example.org/v/kjg/']]
Upvotes: 2