Nick Johnson
Nick Johnson

Reputation: 95

Web Scraping a List in Python Using BeautifulSoup

I am new at Python and trying to learn how to use BeautifulSoup to scrape a webpage. For starters, I was just using yahoo.com's HTML code:

view-source:https://www.yahoo.com/

I wanted to scrape the list of links starting on row 577 and ending at 633 and get their URL and the title and put it in table in Python.

So far, I have the following:

from bs4 import BeautifulSoup

myURL = "http://www.yahoo.com"
myPage = requests.get(myURL)

yahoo = BeautifulSoup(myPage.content)

print yahoo.prettify()

YahooList = yahoo.find('ul', class_="Pos(r) Miw(1000px) Pstart(9px) Lh(1.7) Reader-open_Op(0) mini-header_Op(0)")
print YahooList

I am unsure of how to proceed further from this. All the examples I am finding are for web scraping from tables but I am not finding much on how to do it on a list.

Does anyone have any suggestions?

Thanks, Nick

Upvotes: 1

Views: 6804

Answers (1)

Remi Guan
Remi Guan

Reputation: 22272

If you need only scrape specific lines, you need get these line before you scrape it. I'd suggest use str.splitlines() and a list slice to get them.

For example:

>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://www.yahoo.com')
>>> print('\n'.join(r.text.splitlines()[575:634]))

The output is:

<li class="D(b)">
    <a href="https://www.yahoo.com/politics/" class="D(b) Fz(13px) C($topbarMenu) Py(3px) Td(n) Td(u):h"   data-ylk="slk:politics;t5:politics;cpos:9;" tabindex="1">Politics</a>
</li>

<li class="D(b)">
    <a href="https://www.yahoo.com/celebrity/" class="D(b) Fz(13px) C($topbarMenu) Py(3px) Td(n) Td(u):h"   data-ylk="slk:celebrity;t5:celebrity;cpos:10;" tabindex="1">Celebrity</a>
</li>

...

<li class="D(b)">
    <a href="https://www.yahoo.com/travel/" class="D(b) Fz(13px) C($topbarMenu) Py(3px) Td(n) Td(u):h"   data-ylk="slk:travel;t5:travel;cpos:22;" tabindex="1">Travel</a>
</li>

<li class="D(b)">
    <a href="https://www.yahoo.com/autos/" class="D(b) Fz(13px) C($topbarMenu) Py(3px) Td(n) Td(u):h"   data-ylk="slk:autos;t5:autos;cpos:23;" tabindex="1">Autos</a>
</li>
  • r.text.splitlines() split the HTML source code by lines, and gives a list.

  • [575:634] is a list slice, which slices the list, and gives lines from 576 to 634. I added two more lines because without them, the output will be:

        <a href="https://www.yahoo.com/politics/" class="D(b) Fz(13px) C($topbarMenu) Py(3px) Td(n) Td(u):h"   data-ylk="slk:politics;t5:politics;cpos:9;" tabindex="1">Politics</a>
    </li>
    
    <li class="D(b)">
        <a href="https://www.yahoo.com/celebrity/" class="D(b) Fz(13px) C($topbarMenu) Py(3px) Td(n) Td(u):h"   data-ylk="slk:celebrity;t5:celebrity;cpos:10;" tabindex="1">Celebrity</a>
    </li>
    
    ...
    
    <li class="D(b)">
        <a href="https://www.yahoo.com/travel/" class="D(b) Fz(13px) C($topbarMenu) Py(3px) Td(n) Td(u):h"   data-ylk="slk:travel;t5:travel;cpos:22;" tabindex="1">Travel</a>
    </li>
    
    <li class="D(b)">
        <a href="https://www.yahoo.com/autos/" class="D(b) Fz(13px) C($topbarMenu) Py(3px) Td(n) Td(u):h"   data-ylk="slk:autos;t5:autos;cpos:23;" tabindex="1">Autos</a>
    

    And which isn't a valid HTML code block.

  • '\n'.join() joins the list by \n, and gives another string which you want.


After we have the specific lines :

>>> soup = BeautifulSoup('\n'.join(r.text.splitlines()[575:634]), 'html.parser')
>>> for i in soup.find_all('a'):
...     print(i.get('href'))
...     
... 
https://www.yahoo.com/politics/
https://www.yahoo.com/celebrity/
https://www.yahoo.com/movies/
https://www.yahoo.com/music/
https://www.yahoo.com/tv/
https://www.yahoo.com/health/
https://www.yahoo.com/style/
https://www.yahoo.com/beauty/
https://www.yahoo.com/food/
https://www.yahoo.com/parenting/
https://www.yahoo.com/makers/
https://www.yahoo.com/tech/
https://shopping.yahoo.com/
https://www.yahoo.com/travel/
https://www.yahoo.com/autos/

soup.find_all('a') finds all the <a> HTML tags in the string (HTML code block) we have, and gives a list of these tags.

Then, we use for loop over the list, and use i.get('href') to get the href attribute (the link you want) of the <a> tag.


You can also use a list comprehension to put the result into a list, rather than print it out:

import requests
from bs4 import BeautifulSoup

r = requests.get('http://www.yahoo.com')
soup = BeautifulSoup('\n'.join(r.text.splitlines()[575:634]), 'html.parser')

l = [i.get('href') for i in soup.find_all('a')]

l is the list which you're looking for.


If you also want get the title of these links, you can use i.text to get it. However, there's no table object in Python, I think you mean dict:

>>> d = {i.text: i.get('href') for i in soup.find_all('a')}
>>> pprint(d)
{'Autos': 'https://www.yahoo.com/autos/',
 'Beauty': 'https://www.yahoo.com/beauty/',
 'Celebrity': 'https://www.yahoo.com/celebrity/',
 'Food': 'https://www.yahoo.com/food/',
 'Health': 'https://www.yahoo.com/health/',
 'Makers': 'https://www.yahoo.com/makers/',
 'Movies': 'https://www.yahoo.com/movies/',
 'Music': 'https://www.yahoo.com/music/',
 'Parenting': 'https://www.yahoo.com/parenting/',
 'Politics': 'https://www.yahoo.com/politics/',
 'Shopping': 'https://shopping.yahoo.com/',
 'Style': 'https://www.yahoo.com/style/',
 'TV': 'https://www.yahoo.com/tv/',
 'Tech': 'https://www.yahoo.com/tech/',
 'Travel': 'https://www.yahoo.com/travel/'}
>>> d['TV']
'https://www.yahoo.com/tv/'
>>> d['Food']
'https://www.yahoo.com/food/'

So you can use {i.text: i.get('href') for i in soup.find_all('a')} to get the dict you want.

In this case, i.text (title) is the keys in that dict, for example 'TV' and 'Food'.

And i.get('href') is the value (links), for example 'https://www.yahoo.com/tv/' and 'https://www.yahoo.com/food/'.

You can access the value by d[key] as my code above.

Upvotes: 1

Related Questions