Using Regex for URLs with BeautifulSoup?

Question

import urllib2
import re
import csv
from bs4 import BeautifulSoup

def get_BlahBlah(num1, num2, num3, num4):
    url1 = "http://BlahBlah.com/person_profile/"
    url2 = "?-id="
    url3 = "."
    url4 = "&source=personalranking="
    urlComplete = url1 + str(num1) + url2 + str(num2) + url3 + str(num3) + url4 + str(num4) 
    page = urllib2.urlopen(urlComplete)
    soup_BlahBlah = BeautifulSoup(page, "lxml")
    page.close()

    rank_tag = soup_BlahBlah.find('h1', class_="personal_rank") 

    if rank_tag:
        rank_string = rank_tag.span.string
        return rank_string

for num1_count in range(28343512, 28343512):
    for num2_count in range(9999888888, 9999888889):
        for num3_count in range (7777, 7778):
            for num4_count in range(0, 1):

                record = get_BlahBlah(num1_count, num2_count, num3_count, num4_count)

                saveFile = open('BlahBlah.csv', 'a')
                saveFile.write(str(record)+'
')
                saveFile.close()

                num4_count += 1
            num3_count += 1
        num2_count += 1
    num1_count += 1

The above code is working but I want to tweak it better and more efficient for my needs. What I am trying to do is to crawl and extract the "rank" information (user class "personal_rank" tag) for each unique individual. And I want to crawl all the people in the entire site.

The site's URL structure is composed of various static and varying (numeric) parts, for example:

http://BlahBlah.com/person_profile/XXXXXXXX?-id=XXXXXXXXXX.XXXX&source=personalranking=X *notice this is not the site I want to crawl, just used as an example

Where X can be any number from 0-9. Here are my three different questions:

Let's say all the numeric portions on the URLs are unique for a single person, and I can to cycle through the multiple loops like my current codes, is there other way (more efficient) I should be doing (instead of having four loops since I find it very time-consuming).
Now, let's say, only num1_count is unique to a single person, and num2_count, num3_count, and num4_count portions can be any combinations (as long as the corresponding digits remain the same) and will still refer to the same person (see example below), how can I use Regex to replace my current code? And if I use Regex to represent parts of the URLs, how can I combine it with loops?

1) http://BlahBlah.com/person_profile/12345678?-id=1111111111.1111&source=personalranking=1 refers to Peter Pan 2) http://BlahBlah.com/person_profile/12345678?-id=2222222222.1111&source=personalranking=1 also refers to Peter Pan 3) http://BlahBlah.com/person_profile/12345670?-id=2222222222.1111&source=personalranking=1 refers to Robin King

Follow up with point number 2, let's say the number of digits for num1_count-num3_count matter, but the last numeric portion doesn't matter in a sense that it can be a single or double digits and will still refer to the same person, how can I code it?

Thanks in advance.

Using Regex for URLs with BeautifulSoup?

Answers (0)

Related Questions