Reputation: 2185
I try to create a CSV for all the tables which appear in each link. This is the link
In the link there are 36 links, so 36 csv should be generated. When I run my code, 36 csv are created but they are all empty. My code is below:
import csv
import urllib2
from bs4 import BeautifulSoup
first=urllib2.urlopen("http://www.admision.unmsm.edu.pe/admisionsabado/A.html").read()
soup=BeautifulSoup(first)
w=[]
for q in soup.find_all('tr'):
for link in q.find_all('a'):
w.append(link["href"])
l=[]
for t in w:
l.append(t.replace(".","",1))
def record (part) :
url="http://www.admision.unmsm.edu.pe/admisionsabado".format(part)
u=urllib2.urlopen(url)
try:
html=u.read()
finally:
u.close()
soup=BeautifulSoup(html)
c=[]
for n in soup.find_all('center'):
for b in n.find_all('a')[2:]:
c.append(b.text)
t=(len(c))/2
part=part[:-6]
name=part.replace("/","")
with open('{}.csv'.format(name), 'wb') as f:
writer = csv.writer(f)
for i in range(t):
url = "http://www.admision.unmsm.edu.pe/admisionsabado{}{}.html".format(part,i)
u = urllib2.urlopen(url)
try:
html = u.read()
finally:
u.close()
soup=BeautifulSoup(html)
for tr in soup.find_all('tr')[1:]:
tds = tr.find_all('td')
row = [elem.text.encode('utf-8') for elem in tds[:6]]
writer.writerow(row)
With this for
, I run the created function to create the CSV per link.
for n in l:
record(n)
EDIT: According to the advice of Alecxe, I change the code, and it's working OK just for the fist two links. Moreover, There's a message HTTP Error 404: Not Found
. I revise in the directory and there are just two csv which are created correctly.
Here's the code:
import csv
import urllib2
from bs4 import BeautifulSoup
def record(part):
soup = BeautifulSoup(urllib2.urlopen("http://www.admision.unmsm.edu.pe/admisionsabado".format(part)))
c=[]
for n in soup.find_all('center'):
for b in n.find_all('a')[1:]:
c.append(b.text)
t = (len(links)) / 2
part = part[:-6]
name = part.replace("/", "")
with open('{}.csv'.format(name), 'wb') as f:
writer = csv.writer(f)
for i in range(t):
url = "http://www.admision.unmsm.edu.pe/admisionsabado{}{}.html".format(part, i)
soup = BeautifulSoup(urllib2.urlopen(url))
for tr in soup.find_all('tr')[1:]:
tds = tr.find_all('td')
row = [elem.text.encode('utf-8') for elem in tds[:6]]
writer.writerow(row)
soup = BeautifulSoup(urllib2.urlopen("http://www.admision.unmsm.edu.pe/admisionsabado/A.html"))
links = [tr.a["href"].replace(".", "", 1) for tr in soup.find_all('tr')]
for link in links:
record(link)
Upvotes: 1
Views: 217
Reputation: 474231
soup.find_all('center')
finds nothing.
Replace:
c=[]
for n in soup.find_all('center'):
for b in n.find_all('a')[2:]:
c.append(b.text)
with:
c = [link.text for link in soup.find('table').find_all('a')[2:]]
Also, you can pass urllib2.urlopen(url)
directly to the BeautifulSoup
constructor:
soup = BeautifulSoup(urllib2.urlopen(url))
Also, since you have only one link in the row, you can simplify the way you are getting a list of links. Instead of:
w=[]
for q in soup.find_all('tr'):
for link in q.find_all('a'):
w.append(link["href"])
do this:
links = [tr.a["href"] for tr in soup.find_all('tr')]
Also, pay attention to how you are naming variables and code formatting. See:
Upvotes: 1