Reputation: 5476
I'm trying to write a basic web crawler in Python. The trouble I have is parsing the page to extract url's. I've both tried BeautifulSoup and regex however I cannot achieve an efficient solution.
As an example: I'm trying to extract all the member urls in Facebook's Github page. (https://github.com/facebook?tab=members). The code I've written extracts member URL's;
def getMembers(url):
text = urllib2.urlopen(url).read();
soup = BeautifulSoup(text);
memberList = []
#Retrieve every user from the company
#url = "https://github.com/facebook?tab=members"
data = soup.findAll('ul',attrs={'class':'members-list'});
for div in data:
links = div.findAll('li')
for link in links:
memberList.append("https://github.com" + str(link.a['href']))
return memberList
However this takes quite a while to parse and I was wondering if I could do it more efficiently, since crawling process is too long.
Upvotes: 2
Views: 2618
Reputation: 1
Check the post Extremely Simple Web Crawler for a simple and easy to understand python script that crawls webpages and collects all the valid hyperlinks depending on the seed URL and depth:
Upvotes: 0
Reputation: 113
In order to prevent writing the scraper yourself you can use available ones. Maybe try scrapy, it uses python and it's available on github. http://scrapy.org/
Upvotes: 1
Reputation: 1580
I suggest that you use GitHub API, that let you do exactly what you want to accomplish. Then it's only a matter of using a json parser and you are done.
http://developer.github.com/v3/orgs/members/
Upvotes: 1