user10798111
user10798111

Reputation:

How can I get a Wikipedia article's text using Python 3 with Beautiful Soup?

I have this script made in Python 3:

response = simple_get("https://en.wikipedia.org/wiki/Mathematics")
result = {}
result["url"] = url
if response is not None:
    html = BeautifulSoup(response, 'html.parser')
    title = html.select("#firstHeading")[0].text

As you can see I can get the title from the article, but I cannot figure out how to get the text from "Mathematics (from Greek μά..." to the contents table...

Upvotes: 21

Views: 13241

Answers (7)

northman23
northman23

Reputation: 44

I use this: Via 'idx' I can determine which paragraph I want to read.

from from bs4 import BeautifulSoup
import requests

res = requests.get("https://de.wikipedia.org/wiki/Pferde")
soup = BeautifulSoup(res.text, 'html.parser')
for idx, item in enumerate(soup.find_all("p")):
    if idx == 1:
        break
print(item.text)

Upvotes: -1

LaSul
LaSul

Reputation: 2411

To get a proper way using function, you can just get JSON API offered by Wikipedia :

from urllib.request import urlopen
from urllib.parse import urlencode
from json import loads


def getJSON(page):
    params = urlencode({
        'format': 'json',
        'action': 'parse',
        'prop': 'text',
        'redirects' : 'true',
        'page': page})
    API = "https://en.wikipedia.org/w/api.php"
    response = urlopen(API + "?" + params)
    return response.read().decode('utf-8')


def getRawPage(page):
    parsed = loads(getJSON(page))
    try:
        title = parsed['parse']['title']
        content = parsed['parse']['text']['*']
        return title, content
    except KeyError:
        # The page doesn't exist
        return None, None

title, content = getRawPage("Mathematics")

enter image description here

You can then parse it with any library you want to extract what you need :)

Upvotes: 3

Ilmari Karonen
Ilmari Karonen

Reputation: 50328

What you seem to want is the (HTML) page content without the surrounding navigation elements. As I described in this earlier answer from 2013, there are (at least) two ways to get it:

The advantage of using the API is that it can also give you a lot of other information about the page that you may find useful. For example, if you'd like to have a list of the interlanguage links normally shown in the page's sidebar, or the categories normally shown below the content area, you can get those from the API like this:

https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics&prop=langlinks|categories

(To also get the page content with the same request, use prop=langlinks|categories|text.)

There are several Python libraries for using the MediaWiki API that can automate some of nitty gritty details of using it, although the feature set they support can vary. That said, just using the API directly from your code without a library in between is perfectly possible, too.

Upvotes: 3

alecxe
alecxe

Reputation: 473813

There is a much, much more easy way to get information from wikipedia - Wikipedia API.

There is this Python wrapper, which allows you to do it in a few lines only with zero HTML-parsing:

import wikipediaapi

wiki_wiki = wikipediaapi.Wikipedia('en')

page = wiki_wiki.page('Mathematics')
print(page.summary)

Prints:

Mathematics (from Greek μάθημα máthēma, "knowledge, study, learning") includes the study of such topics as quantity, structure, space, and change...(omitted intentionally)

And, in general, try to avoid screen-scraping if there's a direct API available.

Upvotes: 37

SIM
SIM

Reputation: 22440

You can get the desired output using lxml library like following.

import requests
from lxml.html import fromstring

url = "https://en.wikipedia.org/wiki/Mathematics"

res = requests.get(url)
source = fromstring(res.content)
paragraph = '\n'.join([item.text_content() for item in source.xpath('//p[following::h2[2][span="History"]]')])
print(paragraph)

Using BeautifulSoup:

from bs4 import BeautifulSoup
import requests

res = requests.get("https://en.wikipedia.org/wiki/Mathematics")
soup = BeautifulSoup(res.text, 'html.parser')
for item in soup.find_all("p"):
    if item.text.startswith("The history"):break
    print(item.text)

Upvotes: 7

QHarr
QHarr

Reputation: 84465

Use the library wikipedia

import wikipedia
#print(wikipedia.summary("Mathematics"))
#wikipedia.search("Mathematics")
print(wikipedia.page("Mathematics").content)

Upvotes: 19

chitown88
chitown88

Reputation: 28565

select the <p> tag. There are 52 elements. Not sure if you want the whole thing, but you can iterate through those tags to store it as you may. I just chose to print each of them to show the output.

import bs4
import requests


response = requests.get("https://en.wikipedia.org/wiki/Mathematics")

if response is not None:
    html = bs4.BeautifulSoup(response.text, 'html.parser')

    title = html.select("#firstHeading")[0].text
    paragraphs = html.select("p")
    for para in paragraphs:
        print (para.text)

    # just grab the text up to contents as stated in question
    intro = '\n'.join([ para.text for para in paragraphs[0:5]])
    print (intro)

Upvotes: 20

Related Questions