KubiK888
KubiK888

Reputation: 4723

Using Regex for URLs with BeautifulSoup?

import urllib2
import re
import csv
from bs4 import BeautifulSoup

def get_BlahBlah(num1, num2, num3, num4):
    url1 = "http://BlahBlah.com/person_profile/"
    url2 = "?-id="
    url3 = "."
    url4 = "&source=personalranking="
    urlComplete = url1 + str(num1) + url2 + str(num2) + url3 + str(num3) + url4 + str(num4) 
    page = urllib2.urlopen(urlComplete)
    soup_BlahBlah = BeautifulSoup(page, "lxml")
    page.close()

    rank_tag = soup_BlahBlah.find('h1', class_="personal_rank") 

    if rank_tag:
        rank_string = rank_tag.span.string
        return rank_string

for num1_count in range(28343512, 28343512):
    for num2_count in range(9999888888, 9999888889):
        for num3_count in range (7777, 7778):
            for num4_count in range(0, 1):

                record = get_BlahBlah(num1_count, num2_count, num3_count, num4_count)

                saveFile = open('BlahBlah.csv', 'a')
                saveFile.write(str(record)+'\n')
                saveFile.close()

                num4_count += 1
            num3_count += 1
        num2_count += 1
    num1_count += 1

The above code is working but I want to tweak it better and more efficient for my needs. What I am trying to do is to crawl and extract the "rank" information (user class "personal_rank" tag) for each unique individual. And I want to crawl all the people in the entire site.

The site's URL structure is composed of various static and varying (numeric) parts, for example:

http://BlahBlah.com/person_profile/XXXXXXXX?-id=XXXXXXXXXX.XXXX&source=personalranking=X *notice this is not the site I want to crawl, just used as an example

Where X can be any number from 0-9. Here are my three different questions:

1) http://BlahBlah.com/person_profile/12345678?-id=1111111111.1111&source=personalranking=1 refers to Peter Pan 2) http://BlahBlah.com/person_profile/12345678?-id=2222222222.1111&source=personalranking=1 also refers to Peter Pan 3) http://BlahBlah.com/person_profile/12345670?-id=2222222222.1111&source=personalranking=1 refers to Robin King

Thanks in advance.

Upvotes: 2

Views: 377

Answers (0)

Related Questions