Reputation: 4689
I'm trying to parse a wiki page here, but I only want the certain parts. Those links in the main article, I'd like to parse them all. Is there an article or tutorial on how to do it? I'm assuming I'd be using BS4. Can anyone help?
Specifically speaking; the links that are under all the main headers in the page.
Upvotes: 0
Views: 140
Reputation: 318
Well, it really depends on what you mean by "parse" but here is a full working example on how to extract all links from the main section with BeautfulSoup:
from bs4 import BeautifulSoup
import urllib.request
def main():
url = 'http://yugioh.wikia.com/wiki/Card_Tips%3aBlue-Eyes_White_Dragon'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page.read())
content = soup.find('div',id='mw-content-text')
links = content.findAll('a')
for link in links:
print(link.get_text())
if __name__ == "__main__":
main()
This code should be self explanatory, but just in case:
urllib.reauest.urlopen
and pass its contents to BSmw-content-text
can be found in the page's source)for
loop we print all the links.Additional methods, you might need for parsing the links:
link.get('href')
extracts the destination urllink.get('title')
extracts the alternative title of the linkAnd since you asked for resources: http://www.crummy.com/software/BeautifulSoup/bs4/doc/ is the first place you should start.
Upvotes: 1