user2592038
user2592038

Reputation: 375

load url from a text file using python

I put 200 urls in a text file called url.txt like this:

url_1
url_2
url_3
....
url_n

And i want to go through all of them in python to get the content of each url's page(the text). What is the simplest way go through each url from this text file? scrapy? or just write another script?

import urllib
from bs4 import BeautifulSoup as BS

html =urllib.urlopen('url').read()

soup = BS(html)


print soup.find('div',{'class':'drkgry'})[1].get_text()

Upvotes: 0

Views: 1493

Answers (2)

R. Max
R. Max

Reputation: 6710

Scrapy might be overkill for this task unless you want to crawl really fast (due the async nature), following links, extracting many fields, etc.

A spider for this would be like

from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider


class MySpider(BaseSpider):
    name = 'myspider'

    def start_requests(self):
        with open('urls.txt') as fp:
            for line in fp:
                yield Request(line.strip(), callback=self.parse_website)

    def parse_website(self, response):
        hxs = HtmlXPathSelector(response)
        print hxs.select('//div[@class="drkgry"]/text()').extract()

You can skip creating a full project. Save it as myspider.py and run scrapy runspider myspider.py having the urls.txt file in the same directory.

Upvotes: 1

Brionius
Brionius

Reputation: 14098

This seems pretty simple - is this what you're looking for?

import urllib2

with open('MyFileOfURLs.txt', 'r') as f:
    urls = []
    for url in f:
        urls.append(url.strip())

html = {}
for url in urls:
    urlFile = urllib2.urlopen(url) as urlFile
    html[url] = urlFile.read()
    urlFile.close()

print html

Upvotes: 2

Related Questions