How do I access urls in an excel file and scrape information stored in those links using beautiful soup?

Question

I am trying to access a set of urls present in rows and scrape respective information from all those links and store it in a text file. I have my links stored in a file - "ctp_output.csv" Currently I am able to extract information by directly providing a single link. Require some guidance.

import csv
import urllib2
from bs4 import BeautifulSoup
url = "http://www.thedrum.com/news/2015/07/29/mankind-must-get-ahead-technical-development-states-phds-mark-holden-following"

soup = BeautifulSoup(urllib2.urlopen(url))
with open('ctp_output.txt', 'w') as f:
    for tag in soup.find_all('p'):
        f.write(tag.text.encode('utf-8') + '
')

cs95 · Accepted Answer

The next step is to open the csv file and then loop over each line, extracting information for each link. You can do that like this:

import csv

with open('test.csv', 'rb') as f:
    reader = csv.reader(f)
    for line in reader:
        url = line[0] # assuming your url is your first column
        .... # scraping code here

How do I access urls in an excel file and scrape information stored in those links using beautiful soup?

Answers (2)

Related Questions