Parsing txt file, to web scrape an image from each link on each line, with python

Question

I'm trying to open a txt file, with a http link on each line, and then have python go to each link, find a specific image, and print out a direct link to that image, FOR EACH page, listed in the txt file.

But, I have no idea what I'm doing. (started python a few days ago)

Here's my current code, that, does not work...

from urllib2 import urlopen
import re
from bs4 import BeautifulSoup

txt = open('links.txt').read().splitlines()
page = urlopen(txt)
html = page.read()
image_links = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", html)
print image_links

Update 1:

Ok, here's what I need a little more specifically. I have a script which prints out a lot of links into a txt file, each link on it's own line. i.e.

http://link.com/1
http://link.com/2
etc
etc

what I'm trying to accomplish, at the moment is have something that opens that text file, containing those links, and run my regex that I already posted, then print the image links, it will find, in link.com/1 etc, into another text file, which should look something like

http://link.com/1/image.jpg
http://link.com/2/image.jpg

etc.

Then after that, I don't need any help, as I already have a python script which will download the images, from that txt file.

Update 2: Basically, what I need is this script.

from urllib2 import urlopen
import re
from bs4 import BeautifulSoup

url = 'http://staff.tumblr.com'
page = urlopen(url)
html = page.read()
image_links = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", html)

print image_links

but instead of it looking for a specific url, in the url variable, it will crawl all urls in a text file I specify, then print out the results.

backtrack · Accepted Answer

I would suggest you to use Scrapy spider

Here is an example

from scrapy import log
from scrapy.item import Item
from scrapy.http import Request
from scrapy.contrib.spiders import XMLFeedSpider


def NextURL():
    urllist =[]
    with open("URLFilename") as f:
        for line in f:
            urllist.append(line)

class YourScrapingSpider(XMLFeedSpider):

    name = "imagespider"

    allowed_domains = []

    url = NextURL()

    start_urls = []

    def start_requests(self):

        start_url = self.url.next()

        request = Request(start_url, dont_filter=True)

        yield request


    def parse(self, response, node):

        scraped_item = Item()
        yield scraped_item
        next_url = self.url.next()
        yield Request(next_url)

I am creating a spider while will read the URL from file and make the request and download the images.

For this we have to use ImagesPipeline

It will be difficult in the starting stage but i would suggest you to learn about Scrapy. Scrapy is a web crawling framework in Python.

Update :

import re
import sys
import urllib
import urlparse
from BeautifulSoup import BeautifulSoup

class MyOpener(urllib.FancyURLopener):
    version = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15'

def process(url):
    myopener = MyOpener()
    #page = urllib.urlopen(url)
    page = myopener.open(url)

    text = page.read()
    page.close()

    soup = BeautifulSoup(text)

    print(soup)

    for tag in soup.findAll('img'):
        print (tag)
# process(url)

def main():
    url = "https://www.organicfacts.net/health-benefits/fruit/health-benefits-of-grapes.html"
    process(url)


if __name__ == "__main__":
    main()

o/p



Update 2: 

with open(the_filename, 'w') as f:
    for s in image_links:
        f.write(s + '
')

Parsing txt file, to web scrape an image from each link on each line, with python

Answers (1)

Related Questions