Reputation: 1339
I am trying to scrape a website:
http://www.gabar.org/membersearchresults.cfm?start=26&id=E640EC74-9C8E-9913-79DB5D9C376528C0
I know the link above will show that there are no search results, but when I do the search manually there are results.
The problem I am having is when I open this link in my browser I am able to see a page as expected however when I open it in beautiful soup the output I get something along the lines that this search is not available.
I am new to this so not quite sure how this works, do websites have things built in that make things like this (urllib2/beautifulsoup) not work?
File = urllib2.urlopen("http://www.gabar.org/membersearchresults.cfm?start=26&id=E640EC74-9C8E-9913-79DB5D9C376528C0")
Html = File.read()
File.close()
soup = BeautifulSoup(Html)
AllLinks = soup.find_all("a")
lawyerlinks = []
for link in soup.find_all("a"):
lawyerlinks.append(link.get('href'))
lawyerlinks = lawyerlinks[76:100]
print lawyerlinks
Upvotes: 0
Views: 133
Reputation: 229311
That's fascinating. Going to the first page of results works, and then clicking "Next" works, and all it does is take you to the URL you posted. But if I visit that URL directly, I get no results.
Note that urllib2.urlopen
is indeed behaving exactly like a browser here. If you open a browser directly to that page, you get no results - which is exactly what you get with urlopen
.
What you want to do is mimic a browser, visit the first page of results, and then mimic clicking 'next' just like a browser would. The best library I know of for this is mechanize.
import mechanize
br = mechanize.Browser()
br.open("http://www.gabar.org/membersearchresults.cfm?id=ED162783-9C8E-9913-79DBE86CBE9FB115")
response1 = br.follow_link(text_regex=r"Next", nr=0)
Html = response1.read()
#rest is the same
Upvotes: 3