Reputation: 15
I'm trying to open a txt file, with a http link on each line, and then have python go to each link, find a specific image, and print out a direct link to that image, FOR EACH page, listed in the txt file.
But, I have no idea what I'm doing. (started python a few days ago)
Here's my current code, that, does not work...
from urllib2 import urlopen
import re
from bs4 import BeautifulSoup
txt = open('links.txt').read().splitlines()
page = urlopen(txt)
html = page.read()
image_links = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", html)
print image_links
Update 1:
Ok, here's what I need a little more specifically. I have a script which prints out a lot of links into a txt file, each link on it's own line. i.e.
http://link.com/1
http://link.com/2
etc
etc
what I'm trying to accomplish, at the moment is have something that opens that text file, containing those links, and run my regex that I already posted, then print the image links, it will find, in link.com/1 etc, into another text file, which should look something like
http://link.com/1/image.jpg
http://link.com/2/image.jpg
etc.
Then after that, I don't need any help, as I already have a python script which will download the images, from that txt file.
Update 2: Basically, what I need is this script.
from urllib2 import urlopen
import re
from bs4 import BeautifulSoup
url = 'http://staff.tumblr.com'
page = urlopen(url)
html = page.read()
image_links = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", html)
print image_links
but instead of it looking for a specific url, in the url variable, it will crawl all urls in a text file I specify, then print out the results.
Upvotes: 0
Views: 1275
Reputation: 8144
I would suggest you to use Scrapy spider
Here is an example
from scrapy import log
from scrapy.item import Item
from scrapy.http import Request
from scrapy.contrib.spiders import XMLFeedSpider
def NextURL():
urllist =[]
with open("URLFilename") as f:
for line in f:
urllist.append(line)
class YourScrapingSpider(XMLFeedSpider):
name = "imagespider"
allowed_domains = []
url = NextURL()
start_urls = []
def start_requests(self):
start_url = self.url.next()
request = Request(start_url, dont_filter=True)
yield request
def parse(self, response, node):
scraped_item = Item()
yield scraped_item
next_url = self.url.next()
yield Request(next_url)
I am creating a spider while will read the URL from file and make the request and download the images.
For this we have to use ImagesPipeline
It will be difficult in the starting stage but i would suggest you to learn about Scrapy. Scrapy is a web crawling framework in Python.
Update :
import re
import sys
import urllib
import urlparse
from BeautifulSoup import BeautifulSoup
class MyOpener(urllib.FancyURLopener):
version = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15'
def process(url):
myopener = MyOpener()
#page = urllib.urlopen(url)
page = myopener.open(url)
text = page.read()
page.close()
soup = BeautifulSoup(text)
print(soup)
for tag in soup.findAll('img'):
print (tag)
# process(url)
def main():
url = "https://www.organicfacts.net/health-benefits/fruit/health-benefits-of-grapes.html"
process(url)
if __name__ == "__main__":
main()
o/p
<img src="https://www.organicfacts.net/wp-content/uploads/wordpress-popular-posts/1430-35x35.jpg" title="Coconut Oil for Skin" alt="Coconut Oil for Skin" width="35" height="35" class="wpp-thumbnail wpp_cached_thumb wpp_featured" />
<img src="https://www.organicfacts.net/wp-content/uploads/wordpress-popular-posts/1427-35x35.jpg" title="Coconut Oil for Hair" alt="Coconut Oil for Hair" width="35" height="35" class="wpp-thumbnail wpp_cached_thumb wpp_featured" />
<img src="https://www.organicfacts.net/wp-content/uploads/wordpress-popular-posts/335-35x35.jpg" title="Health Benefits of Cranberry Juice" alt="Health Benefits of Cranberry Juice" width="35" height="35" class="wpp-thumbnail wpp_cached_thumb wpp_featured" />
<img src="https://www.organicfacts.net/wp-content/uploads/wordpress-popular-posts/59-35x35.jpg"
Update 2:
with open(the_filename, 'w') as f:
for s in image_links:
f.write(s + '\n')
Upvotes: 1