Illuminati
Illuminati

Reputation: 565

Searching for a word in a url error

I have one million odd urls and search terms in a text file with unique ID. I need to open the urls and search for the searchterms, if present represent as 1 else 0.

Input file:

"ID" "URL","SearchTerm1","Searchterm2"
"1","www.google.com","a","b"
"2","www.yahoo.com","f","g"
"3","www.att.net","k"
"4" , "www.facebook.com","cs","ee"

Code Snippet:

import urllib2
import re
import csv 
import datetime 
from BeautifulSoup import BeautifulSoup

with open('txt.txt') as inputFile, open ('results.txt','w+') as proc_seqf:
        header = 'Id' + '\t' + 'URL' +  '\t'  
        for i in range(1,3):
            header += 'Found_Search' + str(i) +  '\t'
        header += '\n'
        proc_seqf.write(header)
        for line in inputFile:
            line=line.split(",")
            url = 'http://' + line[1]
            req = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"})
            html_content = urllib2.urlopen(req).read()
            soup = BeautifulSoup(html_content)
            if line[2][0:1] == '"' and line[2][-1:] == '"':
                 line[2] = line[2][1:-1]
            matches = soup(text=re.compile(line[2]))
            #print soup(text=re.compile(line[2]))
            #print matches
            if len(matches) == 0 or line[2].isspace() == True:
                output_1 =0
            else:
                output_1 =1
            #print output_1
            #print line[2]
            if line[3][0:1] == '"' and line[3][-1:] == '"':
                 line[3] = line[3][1:-1]
            matches = soup(text=re.compile(line[3]))
            if len(matches) == 0 or line[3].isspace() == True:
                output_2 =0
            else:
                output_2 =1
            #print output_2
            #print line[3]

            proc_seqf.write("{}\t{}\t{}\t{}\n".format(line[0],url,output_1, output_2))

output File:

ID,SearchTerm1,Searchterm2
1,0,1
2,1,0
3,0
4,1,1

Two issues with the code:

  1. when I run around 200 urls at once it gives me urlopen error [Errno 11004] getaddrinfo failed error.

  2. Is there a way to search something which closely matches but not exact match?

Upvotes: 4

Views: 162

Answers (1)

snakecharmerb
snakecharmerb

Reputation: 55874

when I run around 200 urls at once it gives me urlopen error [Errno 11004] getaddrinfo failed error.

This error message is telling you that the DNS lookup for the server hosting the url has failed.

This is a outside the control of your program, but you can decide how to handle the situation.

The simplest approach is to trap the error, log it and carry on:

try:
    html_content = urllib2.urlopen(req).read()
except urllib2.URLError as ex:
    print 'Could not fetch {} because of {}, skipping.'.format(url, ex)
    # skip the rest of the loop
    continue

However, it's possible that the error is transient, and that the lookup will work if you try later; for example, perhaps the DNS server is configured to reject incoming requests if it receives too many in too short a space of time.
In this situation, you can write a function to retry after a delay:

import time

class FetchException(Exception):
    pass

def fetch_url(req, retries=5):
    for i in range(1, retries + 1):
        try:
            html_content = urllib2.urlopen(req).read()
        except urllib2.URLError as ex:
            print 'Could not fetch {} because of {}, skipping.'.format(url, ex)
            time.sleep(1 * i))
            continue
        else:
            return html_content
     # if we reach here then all lookups have failed
     raise FetchFailedException() 

# In your main code
try:
    html_content = fetch_url(req)
except FetchFailedException:
    print 'Could not fetch {} because of {}, skipping.'.format(url, ex)
    # skip the rest of the loop
    continue

Is there a way to search something which closely matches but not exact match?

If you want to match a string with an optional trailing dot, use the ? modifier.

From the docs:

Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’.

>>> s = 'Abc In'
>>> m = re.match(r'Abc In.', s)
>>> m is None
True

# Surround `.` with brackets so that `?` only applies to the `.`
>>> m = re.match(r'Abc In(.)?', s)
>>> m.group()
'Abc In'
>>> m = re.match(r'Abc In(.)?', 'Abc In.')
>>> m.group()
'Abc In.'

Notice the r character preceding the regex patterns. This denotes a raw string. It's good practice to use raw strings in your regex patterns because they make it much easier to handle backslash (\) characters, which are very common in regexes.

So you could construct a regex to match optional trailing dots like this:

matches = soup(text=re.compile(r'{}(.)?').format(line[2]))

Upvotes: 2

Related Questions