rishran
rishran

Reputation: 660

Extracting data from a wikipedia page

This question might be really specific. I am trying to extract the number of employees from the Wikipedia pages of companies such as https://en.wikipedia.org/wiki/3M.

I tried using the Wikipedia python API and some regex queries. However, I couldn't find anything solid that I could generalize for any company (not considering exceptions).

Also, because the table row does not have an id or a class I cannot directly access the value. Following is the source:

<tr>
<th scope="row" style="padding-right:0.5em;">
<div style="padding:0.1em 0;line-height:1.2em;">Number of employees</div>
</th>
<td style="line-height:1.35em;">89,800 (2015)<sup id="cite_ref-FY_1-5" class="reference"><a href="#cite_note-FY-1">[1]</a></sup></td>
</tr>

So, even though I have the id of the table - infobox vcard so I couldn't figure out a way to scrape this information using beautifulSoup.

Is there a way to extract this information? It is present in the summary table on the right at the beginning of the page.

Upvotes: 0

Views: 2727

Answers (2)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

Why reinvent the wheel?

DBpedia

has this information in RDF triples.

See e.g. http://dbpedia.org/page/3M

Upvotes: 0

larsks
larsks

Reputation: 311397

Using lxml.etree instead of BeautifulSoup, you can get what you want with an XPath expression:

>>> from lxml import etree
>>> import requests
>>> r = requests.get('https://en.wikipedia.org/wiki/3M')
>>> doc = etree.fromstring(r.text)
>>> e = doc.xpath('//table[@class="infobox vcard"]/tr[th/div/text()="Number of employees"]/td')
>>> e[0].text
'89,800 (2015)'

Let's take a closer look at that expression:

//table[@class="infobox vcard"]/tr[th/div/text()="Number of employees"]/td

That says:

Find all table elements that have attribute class set to infobox vcard, and inside those elements look for tr elements that have a child th element that has a child div element that contains the text "Number of employees", and inside that tr element, get the first td element.

Upvotes: 2

Related Questions