Reputation: 660
This question might be really specific. I am trying to extract the number of employees from the Wikipedia pages of companies such as https://en.wikipedia.org/wiki/3M.
I tried using the Wikipedia python API and some regex queries. However, I couldn't find anything solid that I could generalize for any company (not considering exceptions).
Also, because the table row does not have an id or a class I cannot directly access the value. Following is the source:
<tr>
<th scope="row" style="padding-right:0.5em;">
<div style="padding:0.1em 0;line-height:1.2em;">Number of employees</div>
</th>
<td style="line-height:1.35em;">89,800 (2015)<sup id="cite_ref-FY_1-5" class="reference"><a href="#cite_note-FY-1">[1]</a></sup></td>
</tr>
So, even though I have the id of the table - infobox vcard
so I couldn't figure out a way to scrape this information using beautifulSoup
.
Is there a way to extract this information? It is present in the summary table on the right at the beginning of the page.
Upvotes: 0
Views: 2727
Reputation: 77454
Why reinvent the wheel?
has this information in RDF triples.
See e.g. http://dbpedia.org/page/3M
Upvotes: 0
Reputation: 311397
Using lxml.etree
instead of BeautifulSoup, you can get what you want with an XPath expression:
>>> from lxml import etree
>>> import requests
>>> r = requests.get('https://en.wikipedia.org/wiki/3M')
>>> doc = etree.fromstring(r.text)
>>> e = doc.xpath('//table[@class="infobox vcard"]/tr[th/div/text()="Number of employees"]/td')
>>> e[0].text
'89,800 (2015)'
Let's take a closer look at that expression:
//table[@class="infobox vcard"]/tr[th/div/text()="Number of employees"]/td
That says:
Find all
table
elements that have attributeclass
set toinfobox vcard
, and inside those elements look fortr
elements that have a childth
element that has a childdiv
element that contains the text "Number of employees", and inside thattr
element, get the firsttd
element.
Upvotes: 2