Reputation: 43
First off the html row looks like this:
<tr class="evenColor"> blahblah TheTextIneed blahblah and ends with </tr>
I would show the real html but I am sorry to say don't know how to block it. feels shame
Using BeautifulSoup (Python) or any other recommended Screen Scraping/Parsing method I would like to output about 1200 .htm files in the same directory into a CSV format. This will eventually go into an SQL database. Each directory represents a year and I plan to do at least 5 years.
I have been goofing around with glob
as the best way to do this from some advice. This is what I have so far and am stuck.
import glob
from BeautifulSoup import BeautifulSoup
for filename in glob.glob('/home/phi/data/NHL/pl0708/pl02*.htm'):
#these files go from pl020001.htm to pl021230.htm sequentially
soup = BeautifulSoup(open(filename["r"]))
for row in soup.findAll("tr", attrs={ "class" : "evenColor" })
I realize this is ugly but it's my first time attempting anything like this. This one problem has taken me months to get to this point after realizing that I don't have to manually go through thousands of files copy and pasting into excel. I have also realized that I can kick my computer repeatedly out of frustration and it still works (not recommended). I am getting close and I need to know what to do next to make those CSV files. Please help or my monitor finally gets hammer punched.
Upvotes: 1
Views: 6868
Reputation: 71939
You need to import the csv module by adding import csv
to the top of your file.
Then you'll need something to create a csv file outside your loop of the rows, like so:
writer = csv.writer(open("%s.csv" % filename, "wb"))
Then you need to actually pull the data out of the html row in your loop, similar to
values = (td.fetchText() for td in row)
writer.writerow(values)
Upvotes: 4
Reputation: 834
You don't really explain why you are stuck - what's not working exactly?
The following line may well be your problem:
soup = BeautifulSoup(open(filename["r"]))
It looks to me like this should be:
soup = BeautifulSoup(open(filename, "r"))
The following line:
for row in soup.findAll("tr", attrs={ "class" : "evenColor" })
looks like it will only pick out even rows (assuming your even rows have the class 'evenColor' and odd rows have 'oddColor'). Assuming you want all rows with a class of either evenColor or oddColor, you can use a regular expression to match the class value:
for row in soup.findAll("tr", attrs={ "class" : re.compile(r"evenColor|oddColor") })
Upvotes: 4
Reputation: 172249
That looks fine, and BeautifulSoup is useful for this (although I personally tend to use lxml). You should be able to take that data you get, and make a csv file out of is using the csv module without any obvious problems...
I think you need to actually tell us what the problem is. "It still doesn't work" is not a problem descripton.
Upvotes: 2