CreamStat
CreamStat

Reputation: 2185

Empty CSV in web scraping - Python

I try to create a CSV for all the tables which appear in each link. This is the link

In the link there are 36 links, so 36 csv should be generated. When I run my code, 36 csv are created but they are all empty. My code is below:

import csv
import urllib2
from bs4 import BeautifulSoup




first=urllib2.urlopen("http://www.admision.unmsm.edu.pe/admisionsabado/A.html").read()
soup=BeautifulSoup(first)
w=[]
for q in soup.find_all('tr'):
    for link in q.find_all('a'):
        w.append(link["href"])



l=[]

for t in w:
    l.append(t.replace(".","",1))





def record (part) :


        url="http://www.admision.unmsm.edu.pe/admisionsabado".format(part)
        u=urllib2.urlopen(url)
        try:
            html=u.read()
        finally:
            u.close()
        soup=BeautifulSoup(html)
        c=[]
        for n in soup.find_all('center'):
            for b in n.find_all('a')[2:]:
                c.append(b.text)

        t=(len(c))/2
        part=part[:-6]
        name=part.replace("/","")


        with open('{}.csv'.format(name), 'wb') as f:
            writer = csv.writer(f)
            for i in range(t):
                url = "http://www.admision.unmsm.edu.pe/admisionsabado{}{}.html".format(part,i)
                u = urllib2.urlopen(url)
                try:
                    html = u.read()
                finally:
                    u.close()
                soup=BeautifulSoup(html)
                for tr in soup.find_all('tr')[1:]:
                    tds = tr.find_all('td')
                    row = [elem.text.encode('utf-8') for elem in tds[:6]]
                    writer.writerow(row)

With this for, I run the created function to create the CSV per link.

 for n in l:
        record(n) 

EDIT: According to the advice of Alecxe, I change the code, and it's working OK just for the fist two links. Moreover, There's a message HTTP Error 404: Not Found . I revise in the directory and there are just two csv which are created correctly.

Here's the code:

import csv
import urllib2
from bs4 import BeautifulSoup



    def record(part):
        soup = BeautifulSoup(urllib2.urlopen("http://www.admision.unmsm.edu.pe/admisionsabado".format(part)))
        c=[]
        for n in soup.find_all('center'):
            for b in n.find_all('a')[1:]:
                c.append(b.text)

        t = (len(links)) / 2
        part = part[:-6]
        name = part.replace("/", "")

        with open('{}.csv'.format(name), 'wb') as f:
            writer = csv.writer(f)
            for i in range(t):
                url = "http://www.admision.unmsm.edu.pe/admisionsabado{}{}.html".format(part, i)
                soup = BeautifulSoup(urllib2.urlopen(url))
                for tr in soup.find_all('tr')[1:]:
                    tds = tr.find_all('td')
                    row = [elem.text.encode('utf-8') for elem in tds[:6]]
                    writer.writerow(row)


    soup = BeautifulSoup(urllib2.urlopen("http://www.admision.unmsm.edu.pe/admisionsabado/A.html"))
    links = [tr.a["href"].replace(".", "", 1) for tr in soup.find_all('tr')]

    for link in links:
        record(link)

Upvotes: 1

Views: 217

Answers (1)

alecxe
alecxe

Reputation: 474231

soup.find_all('center') finds nothing.

Replace:

c=[]
for n in soup.find_all('center'):
    for b in n.find_all('a')[2:]:
        c.append(b.text)

with:

c = [link.text for link in soup.find('table').find_all('a')[2:]]

Also, you can pass urllib2.urlopen(url) directly to the BeautifulSoup constructor:

soup = BeautifulSoup(urllib2.urlopen(url))

Also, since you have only one link in the row, you can simplify the way you are getting a list of links. Instead of:

w=[]
for q in soup.find_all('tr'):
    for link in q.find_all('a'):
        w.append(link["href"])

do this:

links = [tr.a["href"] for tr in soup.find_all('tr')]

Also, pay attention to how you are naming variables and code formatting. See:

Upvotes: 1

Related Questions