thebeancounter
thebeancounter

Reputation: 4839

python retrieve text from multiple random wikipedia pages

I am using python 2.7 with wikipedia package to retrieve the text from multiple random wikipedia pages as explained in the docs.

I use the following code

def get_random_pages_summary(pages = 0):
    import wikipedia
    page_names = [wikipedia.random(1) for i in range(pages)]
    return [[p,wikipedia.page(p).summary] for p in page_names]

text =  get_random_pages_summary(50)

and get the following error

File "/home/user/.local/lib/python2.7/site-packages/wikipedia/wikipedia.py", line 393, in __load raise DisambiguationError(getattr(self, 'title', page['title']), may_refer_to) wikipedia.exceptions.DisambiguationError: "Priuralsky" may refer to: Priuralsky District Priuralsky (rural locality)

what i am trying to do is to get the text. from random pages in Wikipedia, and I need it to be just regular text, without any markdown

I assume that the problem is getting a random name that has more than one option when searching for a Wikipedia page. when i use it to get one Wikipedia page. it works well.

Thanks

Upvotes: 1

Views: 4789

Answers (2)

tell k
tell k

Reputation: 615

According to the document(http://wikipedia.readthedocs.io/en/latest/quickstart.html) the error will return multiple page candidates so you need to search that candidate again.

try:
    wikipedia.summary("Priuralsky")
except wikipedia.exceptions.DisambiguationError as e:
    for page_name in e.options:
        print(page_name)
        print(wikipedia.page(page_name).summary)

You can improve your code like this.

import wikipedia

def get_page_sumarries(page_name):
    try:
        return [[page_name, wikipedia.page(page_name).summary]]
    except wikipedia.exceptions.DisambiguationError as e:
        return [[p, wikipedia.page(p).summary] for p in e.options]

def get_random_pages_summary(pages=0):
    ret = []
    page_names = [wikipedia.random(1) for i in range(pages)]
    for p in page_names:
        for page_summary in get_page_sumarries(p):
            ret.append(page_summary)
    return  ret

text = get_random_pages_summary(50)

Upvotes: 2

Banana
Banana

Reputation: 824

As you're doing it for random articles and with a Wikipedia API (not directly pulling the HTML with different tools) my suggestion would be to catch the DisambiguationError and re-random article in case this happens.

def random_page():
   random = wikipedia.random(1)
   try:
       result = wikipedia.page(random).summary
   except wikipedia.exceptions.DisambiguationError as e:
       result = random_page()
   return result

Upvotes: 4

Related Questions