strip away html tags from extracted links

Question

I have the following code to extract certain links from a webpage:

from bs4 import BeautifulSoup 
import urllib2, sys 
import re 

def tonaton(): 
    site = "http://tonaton.com/en/job-vacancies-in-ghana" 
    hdr = {'User-Agent' : 'Mozilla/5.0'} 
    req = urllib2.Request(site, headers=hdr) 
    jobpass = urllib2.urlopen(req) 
    invalid_tag = ('h2') 
    soup = BeautifulSoup(jobpass) 
    print soup.find_all('h2')

The links are contained in the 'h2' tags so I get the links as follows:

cashiers  
Cake baker 
Automobile Technician 
Marketing Officer

But I'm interested in getting rid of all the 'h2' tags so that I have links only in this manner:

cashiers  
Cake baker 
Automobile Technician 
Marketing Officer

I therefore updated my code to look like this:

def tonaton(): 
    site = "http://tonaton.com/en/job-vacancies-in-ghana" 
    hdr = {'User-Agent' : 'Mozilla/5.0'} 
    req = urllib2.Request(site, headers=hdr) 
    jobpass = urllib2.urlopen(req) 
    invalid_tag = ('h2') 
    soup = BeautifulSoup(jobpass) 
    jobs = soup.find_all('h2') 
    for tag in invalid_tag: 
        for match in jobs(tag): 
            match.replaceWithChildren() 
    print jobs

But I couldn't get it to work, even though I thought that was the best logic i could come up with.I'm a newbie though so I know there is something better that could be done.

Any help will be gracefully appreciated

Thanks

Birei · Accepted Answer

You could navigate for the next element of each

tag:

for h2 in soup.find_all('h2'):
    n = h2.next_element
    if n.name == 'a':  print n

It yields:

Financial Administrator
House help
Office Manager 
...

strip away html tags from extracted links

Answers (1)

Related Questions