Danny Gonzalez
Danny Gonzalez

Reputation: 21

Extracting the text from some HTML tags

I am using BeautifulSoup to webscrape job listings on a career page. I am having trouble just printing out the information I need.

This is was the HTML looks like

<ul class="list-group">
<li class="list-group-item">
<h4 class="list-group-item-heading">
<a href="http://careers.steelseries.com/apply/3LXwyjYOrb/Customer-Experience-Specialist">
                                        Customer Experience Specialist                                    </a>
</h4>
<ul class="list-inline list-group-item-text">
<li><i class="fa fa-map-marker"></i>Chicago, IL</li>
<li><i class="fa fa-sitemap"></i>Operations</li>
</ul>

What I want it to print out is

Customer Experience Specialist
Chicago, IL
Operations
--------------

The code I tried is this:

section = soup.find_all('div', class_='col col-xs-7 jobs-list')
for elem in section:
    wrappers = elem.find('ul').get_text()
    print(wrappers)

But what that does is print it for me with too many new lines and spaces as so:

                                        Customer Experience Specialist                                    


Chicago, IL
Operations

Keep in mind there are also like 4 empty lines above the job title and another new line after 'Operations'

Upvotes: 1

Views: 41

Answers (2)

CCebrian
CCebrian

Reputation: 75

Try this:

sections = soup.find_all('div', class_='col col-xs-7 jobs-list')
sections = [section for section in sections.split("\n") if section and section != " "]
print("\n".join(sections))

Regards!

Upvotes: 1

Manish Kumar Singh
Manish Kumar Singh

Reputation: 411

After get_text() function add rstrip() to remove all trailing newlines .This removes all trailing whitespace, not just a single newline.

Otherwise, if there is only one line in the string S, use S.splitlines()[0].

Upvotes: 0

Related Questions