Reputation: 4723
import urllib2
import re
import csv
from bs4 import BeautifulSoup
def get_BlahBlah(num1, num2, num3, num4):
url1 = "http://BlahBlah.com/person_profile/"
url2 = "?-id="
url3 = "."
url4 = "&source=personalranking="
urlComplete = url1 + str(num1) + url2 + str(num2) + url3 + str(num3) + url4 + str(num4)
page = urllib2.urlopen(urlComplete)
soup_BlahBlah = BeautifulSoup(page, "lxml")
page.close()
rank_tag = soup_BlahBlah.find('h1', class_="personal_rank")
if rank_tag:
rank_string = rank_tag.span.string
return rank_string
for num1_count in range(28343512, 28343512):
for num2_count in range(9999888888, 9999888889):
for num3_count in range (7777, 7778):
for num4_count in range(0, 1):
record = get_BlahBlah(num1_count, num2_count, num3_count, num4_count)
saveFile = open('BlahBlah.csv', 'a')
saveFile.write(str(record)+'\n')
saveFile.close()
num4_count += 1
num3_count += 1
num2_count += 1
num1_count += 1
The above code is working but I want to tweak it better and more efficient for my needs. What I am trying to do is to crawl and extract the "rank" information (user class "personal_rank" tag) for each unique individual. And I want to crawl all the people in the entire site.
The site's URL structure is composed of various static and varying (numeric) parts, for example:
http://BlahBlah.com/person_profile/XXXXXXXX?-id=XXXXXXXXXX.XXXX&source=personalranking=X *notice this is not the site I want to crawl, just used as an example
Where X can be any number from 0-9. Here are my three different questions:
Let's say all the numeric portions on the URLs are unique for a single person, and I can to cycle through the multiple loops like my current codes, is there other way (more efficient) I should be doing (instead of having four loops since I find it very time-consuming).
Now, let's say, only num1_count is unique to a single person, and num2_count, num3_count, and num4_count portions can be any combinations (as long as the corresponding digits remain the same) and will still refer to the same person (see example below), how can I use Regex to replace my current code? And if I use Regex to represent parts of the URLs, how can I combine it with loops?
1) http://BlahBlah.com/person_profile/12345678?-id=1111111111.1111&source=personalranking=1 refers to Peter Pan 2) http://BlahBlah.com/person_profile/12345678?-id=2222222222.1111&source=personalranking=1 also refers to Peter Pan 3) http://BlahBlah.com/person_profile/12345670?-id=2222222222.1111&source=personalranking=1 refers to Robin King
Thanks in advance.
Upvotes: 2
Views: 377