colemen
colemen

Reputation: 1

Finding a certain link in a webpage, using BeautifulSoup

from BeautifulSoup import BeautifulSoup
import urllib2
import re


user = raw_input('begin here!: ')
base = ("http://1337x.org/search/")
print (base + user)
add_on = "/0/"
total_link = (base + user + add_on)
html_data = urllib2.urlopen(total_link, 'r').read()
soup = BeautifulSoup(html_data)
announce = soup.find('a', attrs={'href': re.compile("^/announcelist")})
print announce

i am attempting to retrieve a torrent link preferably the first non sponsored link. from a page and then have it print the link. i am rather new at this coding so as much detail as you can give would be perfect! thank you so much for the help!

Upvotes: 0

Views: 135

Answers (1)

brandizzi
brandizzi

Reputation: 27090

The problem is in your regular expression. You are trying to use the ^ character to negate the regex, but it does not work in your situation. The ^ only negates a set of characters (a set of chars inside []); even in this case it only negates if it is the first char. For example, [^aeiou] means "any character except a, e, i, o and u".

When you use ^ outside a character set, then it matches the beginning of a line. For example, ^aeiou matches a line which starts with the aeiou string.

So, how would you negate a regex? Well, the best way I see is to use a negative lookahead, which is a regex that starts with (?! and ends with ). For your case, it is pretty easy:

(?!/announcelist)

So, replace the re.compile("^/announcelist") by re.compile("(?!/announcelist)") and it should work - at least worked here :)

Upvotes: 1

Related Questions