Yash Agrawal
Yash Agrawal

Reputation: 464

Crawl a wikipedia page in python and store it in a .csv file

This is my python script code

from bs4 import BeautifulSoup 
import requests
url='https://en.wikipedia.org/wiki/List_of_Pok%C3%A9mon'
source_code=requests.get(url)
print(url)
plain_text=source_code.text
soup=BeautifulSoup(plain_text,"html.parser")
print("hello")
f = open('pokemons.csv', 'w')
for table2 in soup.findAll('table'):
        print("yash")
        for i in table2.findAll('tbody'):
            print("here")
            for link in i.findAll('tr'):                
                for x in link.findAll('td'):
                        for y in x.findAll('a'):
                                z=y.get('href')
                                print(z)
                                #f.write(link)
f.close()

All i want to do is crawl all the pokemons name from this wiki link https://en.wikipedia.org/wiki/List_of_Pok%C3%A9mon but problem here is i am unable to get into the specified table i. e the table in which the all the pokemons names are stored but im my above code i m travelling through alll the table and trying to acces the "tbody" tag so that i will access "tr" tags in it but it not happening the same way !! tell me my mistake.

Upvotes: 0

Views: 1058

Answers (1)

Padraic Cunningham
Padraic Cunningham

Reputation: 180441

The tbody is added by the browser so not in the actual source returned by requests so your code could never find anything using it.

All you need to do is get every anchor with a title attribute in each row:

with open('pokemons.csv', 'w') as f:
    table = soup.select_one("table.wikitable.sortable")
    for a in table.select("tr td a[title]"):
         f.write(a.text.encode("utf-8") + "\n")

That gives you 761 names, one per line.

If you were to use find_all and find, it would be like:

 # get all orws
 for tr in table.find_all("tr"):
    # see if there is an anchor inside with a title attribute
    a = tr.find("a", title=True)
    # if there is write the text
    if a:
        f.write(tr.find("a", title=True).text.encode("utf-8") + "\n")

Upvotes: 1

Related Questions