Reputation: 375
I put 200 urls in a text file called url.txt like this:
url_1
url_2
url_3
....
url_n
And i want to go through all of them in python to get the content of each url's page(the text). What is the simplest way go through each url from this text file? scrapy? or just write another script?
import urllib
from bs4 import BeautifulSoup as BS
html =urllib.urlopen('url').read()
soup = BS(html)
print soup.find('div',{'class':'drkgry'})[1].get_text()
Upvotes: 0
Views: 1493
Reputation: 6710
Scrapy might be overkill for this task unless you want to crawl really fast (due the async nature), following links, extracting many fields, etc.
A spider for this would be like
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
class MySpider(BaseSpider):
name = 'myspider'
def start_requests(self):
with open('urls.txt') as fp:
for line in fp:
yield Request(line.strip(), callback=self.parse_website)
def parse_website(self, response):
hxs = HtmlXPathSelector(response)
print hxs.select('//div[@class="drkgry"]/text()').extract()
You can skip creating a full project. Save it as myspider.py
and run scrapy runspider myspider.py
having the urls.txt
file in the same directory.
Upvotes: 1
Reputation: 14098
This seems pretty simple - is this what you're looking for?
import urllib2
with open('MyFileOfURLs.txt', 'r') as f:
urls = []
for url in f:
urls.append(url.strip())
html = {}
for url in urls:
urlFile = urllib2.urlopen(url) as urlFile
html[url] = urlFile.read()
urlFile.close()
print html
Upvotes: 2