maurobio
maurobio

Reputation: 1577

Problem parsing page from Wikipedia with BeautifulSoup

I have a very simple test script for fetching an article from Wikipedia and grabbing the first paragraph of the text which appears in the page (i.e. the summary).

Here it is:

from bs4 import BeautifulSoup
import urllib2

url = "https://en.wikipedia.org/wiki/Vicia_faba" 
print url
source = urllib2.urlopen(url)
soup = BeautifulSoup(source, 'lxml')
print soup
summary = soup.find('p').getText()
print summary

I get nothing when parsing the summary, although the page is successfully fetched and correctly passed to BeautifulSoup.

This looks quite a simple problem, but I could not progress further. BeautifulSoup is full of tricks but unfortunately I am not privy of many of them!

Thanks in advance for any hints or suggestions.

Upvotes: 0

Views: 80

Answers (1)

DirtyBit
DirtyBit

Reputation: 16772

I changed a few things in your code:

Python 3.x:

from bs4 import BeautifulSoup
import urllib.request



url = "https://en.wikipedia.org/wiki/Vicia_faba"
print(url)

with urllib.request.urlopen(url) as url:
    source = url.read()

soup = BeautifulSoup(source, 'lxml')
# print(soup)
# summary = soup.find('<p>').getText()
# print(summary)

for para_tag in soup.find_all('p'):
    print (para_tag.text)

OUTPUT:

Faba sativa Moench.

Vicia faba, also known in the culinary sense as the broad bean, fava bean, or faba bean is a species of flowering plant in the pea and bean family Fabaceae. It is of uncertain origin[1]:160 and widely cultivated as a crop for human consumption. It is also used as a cover crop, the bell bean, which has smaller beans. Varieties with smaller, harder seeds that are fed to horses or other animals are called field bean, tic bean or tick bean. Horse bean, Vicia faba var. equina Pers., is a variety recognized as an accepted name.[2]

Some people suffer from favism, a hemolytic response to the consumption of broad beans, a condition linked to G6PDD. Otherwise the beans, with the outer seed coat removed, can be eaten raw or cooked. In young plants, the outer seed coat can be eaten, and in very young plants, the seed pod can be eaten.

Vicia faba is a stiffly erect plant 0.5 to 1.8 metres (1.6 to 5.9 ft) tall, with stems that are square in cross-section. The leaves are 10 to 25 centimetres (3.9 to 9.8 in) long, pinnate with 2–7 leaflets, and colored a distinct glaucous (Latin: glaucus) grey-green color. Unlike most other vetches, the leaves do not have tendrils for climbing over other vegetation.

The flowers are 1 to 2.5 centimetres (0.39 to 0.98 in) long with five petals; the standard petals are white, the wing petals are white with a black spot (true black, not deep purple or blue as is the case in many "black" colorings)[3] and the keel petals are white. Crimson-flowered broad beans also exist, which were recently saved from extinction.[4] The flowers have a strong sweet scent which is attractive to bees and other pollinators.[5]

goes on ...

EDIT:

You need to understand the way that article is written, grab the outer-div, then the tag within that needs to be grabbed.

Something like:

container = soup.find("div",attrs={'class': 'mw-parser-output'})

paragraph = container.find("p")

for p in container.find_all("p"):
    if 'Vicia faba, ' in p.text or 'Some people suffer ' in p.text:
        print (p.text)

OUTPUT:

Vicia faba, also known in the culinary sense as the broad bean, fava bean, or faba bean is a species of flowering plant in the pea and bean family Fabaceae. It is of uncertain origin[1]:160 and widely cultivated as a crop for human consumption. It is also used as a cover crop, the bell bean, which has smaller beans. Varieties with smaller, harder seeds that are fed to horses or other animals are called field bean, tic bean or tick bean. Horse bean, Vicia faba var. equina Pers., is a variety recognized as an accepted name.[2]

Some people suffer from favism, a hemolytic response to the consumption of broad beans, a condition linked to G6PDD. Otherwise the beans, with the outer seed coat removed, can be eaten raw or cooked. In young plants, the outer seed coat can be eaten, and in very young plants, the seed pod can be eaten.

Upvotes: 2

Related Questions