RRR
RRR

Reputation: 41

How do I access urls in an excel file and scrape information stored in those links using beautiful soup?

I am trying to access a set of urls present in rows and scrape respective information from all those links and store it in a text file. I have my links stored in a file - "ctp_output.csv" Currently I am able to extract information by directly providing a single link. Require some guidance.

import csv
import urllib2
from bs4 import BeautifulSoup
url = "http://www.thedrum.com/news/2015/07/29/mankind-must-get-ahead-technical-development-states-phds-mark-holden-following"

soup = BeautifulSoup(urllib2.urlopen(url))
with open('ctp_output.txt', 'w') as f:
    for tag in soup.find_all('p'):
        f.write(tag.text.encode('utf-8') + '\n')

Upvotes: 0

Views: 2156

Answers (2)

cs95
cs95

Reputation: 402263

The next step is to open the csv file and then loop over each line, extracting information for each link. You can do that like this:

import csv

with open('test.csv', 'rb') as f:
    reader = csv.reader(f)
    for line in reader:
        url = line[0] # assuming your url is your first column
        .... # scraping code here

Upvotes: 1

smaug
smaug

Reputation: 936

You can use import the csv in pandas dataframe using pandas.read_csv(). Then iterate through the rows of the dataframe like

for url in data_frame_name.iterrows():
....use the url to get the information like you did in the question.

Upvotes: 0

Related Questions