Reputation:
I have this script made in Python 3:
response = simple_get("https://en.wikipedia.org/wiki/Mathematics")
result = {}
result["url"] = url
if response is not None:
html = BeautifulSoup(response, 'html.parser')
title = html.select("#firstHeading")[0].text
As you can see I can get the title from the article, but I cannot figure out how to get the text from "Mathematics (from Greek μά..." to the contents table...
Upvotes: 21
Views: 13241
Reputation: 44
I use this: Via 'idx' I can determine which paragraph I want to read.
from from bs4 import BeautifulSoup
import requests
res = requests.get("https://de.wikipedia.org/wiki/Pferde")
soup = BeautifulSoup(res.text, 'html.parser')
for idx, item in enumerate(soup.find_all("p")):
if idx == 1:
break
print(item.text)
Upvotes: -1
Reputation: 2411
To get a proper way using function, you can just get JSON API offered by Wikipedia :
from urllib.request import urlopen
from urllib.parse import urlencode
from json import loads
def getJSON(page):
params = urlencode({
'format': 'json',
'action': 'parse',
'prop': 'text',
'redirects' : 'true',
'page': page})
API = "https://en.wikipedia.org/w/api.php"
response = urlopen(API + "?" + params)
return response.read().decode('utf-8')
def getRawPage(page):
parsed = loads(getJSON(page))
try:
title = parsed['parse']['title']
content = parsed['parse']['text']['*']
return title, content
except KeyError:
# The page doesn't exist
return None, None
title, content = getRawPage("Mathematics")
You can then parse it with any library you want to extract what you need :)
Upvotes: 3
Reputation: 50328
What you seem to want is the (HTML) page content without the surrounding navigation elements. As I described in this earlier answer from 2013, there are (at least) two ways to get it:
Probably the easiest way in your case is to include the parameter action=render
in the URL, as in https://en.wikipedia.org/wiki/Mathematics?action=render. This will give you just the content HTML, and nothing else.
Alternatively, you can also obtain the page content via the MediaWiki API, as in https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics.
The advantage of using the API is that it can also give you a lot of other information about the page that you may find useful. For example, if you'd like to have a list of the interlanguage links normally shown in the page's sidebar, or the categories normally shown below the content area, you can get those from the API like this:
(To also get the page content with the same request, use prop=langlinks|categories|text
.)
There are several Python libraries for using the MediaWiki API that can automate some of nitty gritty details of using it, although the feature set they support can vary. That said, just using the API directly from your code without a library in between is perfectly possible, too.
Upvotes: 3
Reputation: 473813
There is a much, much more easy way to get information from wikipedia - Wikipedia API.
There is this Python wrapper, which allows you to do it in a few lines only with zero HTML-parsing:
import wikipediaapi
wiki_wiki = wikipediaapi.Wikipedia('en')
page = wiki_wiki.page('Mathematics')
print(page.summary)
Prints:
Mathematics (from Greek μάθημα máthēma, "knowledge, study, learning") includes the study of such topics as quantity, structure, space, and change...(omitted intentionally)
And, in general, try to avoid screen-scraping if there's a direct API available.
Upvotes: 37
Reputation: 22440
You can get the desired output using lxml
library like following.
import requests
from lxml.html import fromstring
url = "https://en.wikipedia.org/wiki/Mathematics"
res = requests.get(url)
source = fromstring(res.content)
paragraph = '\n'.join([item.text_content() for item in source.xpath('//p[following::h2[2][span="History"]]')])
print(paragraph)
Using BeautifulSoup
:
from bs4 import BeautifulSoup
import requests
res = requests.get("https://en.wikipedia.org/wiki/Mathematics")
soup = BeautifulSoup(res.text, 'html.parser')
for item in soup.find_all("p"):
if item.text.startswith("The history"):break
print(item.text)
Upvotes: 7
Reputation: 84465
Use the library wikipedia
import wikipedia
#print(wikipedia.summary("Mathematics"))
#wikipedia.search("Mathematics")
print(wikipedia.page("Mathematics").content)
Upvotes: 19
Reputation: 28565
select the <p>
tag. There are 52 elements. Not sure if you want the whole thing, but you can iterate through those tags to store it as you may. I just chose to print each of them to show the output.
import bs4
import requests
response = requests.get("https://en.wikipedia.org/wiki/Mathematics")
if response is not None:
html = bs4.BeautifulSoup(response.text, 'html.parser')
title = html.select("#firstHeading")[0].text
paragraphs = html.select("p")
for para in paragraphs:
print (para.text)
# just grab the text up to contents as stated in question
intro = '\n'.join([ para.text for para in paragraphs[0:5]])
print (intro)
Upvotes: 20