Rahul Gurujala
Rahul Gurujala

Reputation: 196

URLs separation with bs4 and Python

I am scraping a site for a bunch of links and those links are in single HTML div tag with <br /> tag to line break, but when I try to get all URLs from that div it just coming in a single string.

I am unable to separate then in list. My code is as follows:

with below code I'm scraping all links:

links = soup.find('div', id='dle-content').find('div', class_='full').find(
            'div', class_='full-news').find('div', class_='quote').text

Following is html from site:

<div class="quote">
<!--QuoteEBegin-->
https://example.com/asd.html
<br>
https://example.net/abc
<br>
https://example.org/v/kjg/
<br>
<br>
<!--QuoteEEnd-->
</div>

Output which I get from above code:

https://example.com/asd.htmlhttps://example.net/abchttps://example.org/v/kjg/

Output which I want:

[
"https://example.com/asd.html",
"https://example.net/abc",
"https://example.org/v/kjg/"
]

Upvotes: 0

Views: 67

Answers (4)

MITHU
MITHU

Reputation: 164

Another way to achieve the desired output:

from bs4 import BeautifulSoup

html = """
    <div class="quote">
    <!--QuoteEBegin-->
    https://example.com/asd.html
    <br>
    https://example.net/abc
    <br>
    https://example.org/v/kjg/
    <br>
    <br>
    <!--QuoteEEnd-->
    </div>
"""

soup = BeautifulSoup(html,"html.parser")
print([i.strip() for i in soup.find("div",class_="quote").strings if i!='\n'])

Output:

['https://example.com/asd.html', 'https://example.net/abc', 'https://example.org/v/kjg/']

Upvotes: 0

chitown88
chitown88

Reputation: 28630

Split the string, and then use list comprehension to bring it together:

output = 'https://example.com/asd.htmlhttps://example.net/abchttps://example.org/v/kjg/'
split_output = output.split()
new_output = [x for x in split_output if x != '']

Output:

print(new_output)
['https://example.com/asd.html', 'https://example.net/abc', 'https://example.org/v/kjg/']

Upvotes: 0

dzang
dzang

Reputation: 2260

You could fix it with a string manipulation:

new_output = ' http'.join(output.split('http')).split()

Upvotes: 0

baduker
baduker

Reputation: 20052

Try this:

from bs4 import BeautifulSoup

sample = """<div class="quote">
<!--QuoteEBegin-->
https://example.com/asd.html
<br>
https://example.net/abc
<br>
https://example.org/v/kjg/
<br>
<br>
<!--QuoteEEnd-->
</div>"""

soup = BeautifulSoup(sample, "html.parser").find_all("div", class_="quote")
print([i.getText().split() for i in soup])

Output:

[['https://example.com/asd.html', 'https://example.net/abc', 'https://example.org/v/kjg/']]

Upvotes: 2

Related Questions