tareq
tareq

Reputation: 1339

Using urllib2 and BeautifulSoup not receiving the data I view in browser

I am trying to scrape a website:

http://www.gabar.org/membersearchresults.cfm?start=26&id=E640EC74-9C8E-9913-79DB5D9C376528C0

I know the link above will show that there are no search results, but when I do the search manually there are results.

The problem I am having is when I open this link in my browser I am able to see a page as expected however when I open it in beautiful soup the output I get something along the lines that this search is not available.

I am new to this so not quite sure how this works, do websites have things built in that make things like this (urllib2/beautifulsoup) not work?

File = urllib2.urlopen("http://www.gabar.org/membersearchresults.cfm?start=26&id=E640EC74-9C8E-9913-79DB5D9C376528C0")

Html = File.read()
File.close()

soup = BeautifulSoup(Html)
AllLinks = soup.find_all("a")

lawyerlinks = []

for link in soup.find_all("a"):
    lawyerlinks.append(link.get('href'))

lawyerlinks = lawyerlinks[76:100]

print lawyerlinks

Upvotes: 0

Views: 133

Answers (1)

Claudiu
Claudiu

Reputation: 229311

That's fascinating. Going to the first page of results works, and then clicking "Next" works, and all it does is take you to the URL you posted. But if I visit that URL directly, I get no results.

Note that urllib2.urlopen is indeed behaving exactly like a browser here. If you open a browser directly to that page, you get no results - which is exactly what you get with urlopen.

What you want to do is mimic a browser, visit the first page of results, and then mimic clicking 'next' just like a browser would. The best library I know of for this is mechanize.

import mechanize
br = mechanize.Browser()
br.open("http://www.gabar.org/membersearchresults.cfm?id=ED162783-9C8E-9913-79DBE86CBE9FB115")
response1 = br.follow_link(text_regex=r"Next", nr=0)
Html = response1.read()

#rest is the same

Upvotes: 3

Related Questions