Isak
Isak

Reputation: 545

Scraping urls from html, save in csv using BeautifulSoup

I'm trying to save all hyperlinked urls in an online forum in a CSV file, for a research project.

When I 'print' the html scraping result it seems to be working fine, in the sense that it prints all the urls I want, but I'm unable to write these to separate rows in the CSV.

I'm clearly doing something wrong, but I don't know what! So any help will be greatly appreciated.

Here's the code I've written:

import urllib2
from bs4 import BeautifulSoup
import csv
import re

soup = BeautifulSoup(urllib2.urlopen('http://forum.sex141.com/eforum/forumdisplay.php?    fid=28&page=5').read())

urls = []

for url in soup.find_all('a', href=re.compile('viewthread.php')):
        print url['href']

csvfile = open('Ss141.csv', 'wb')
writer = csv.writer(csvfile)

for url in zip(urls):
        writer.writerow([url])

csvfile.close()

Upvotes: 0

Views: 1710

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1121386

You do need to add your matches to the urls list:

for url in soup.find_all('a', href=re.compile('viewthread.php')):
    print url['href']
    urls.append(url)

and you don't need to use zip() here.

Best just write your urls as you find them, instead of collecting them in a list first:

soup = BeautifulSoup(urllib2.urlopen('http://forum.sex141.com/eforum/forumdisplay.php?fid=28&page=5').read())

with open('Ss141.csv', 'wb') as csvfile:
    writer = csv.writer(csvfile)
    for url in soup.find_all('a', href=re.compile('viewthread.php')):
        writer.writerow([url['href']])

The with statement will close the file object for you when the block is done.

Upvotes: 1

Related Questions