darkhorse
darkhorse

Reputation: 8722

How to get only the useful part of a description from an RSS feed item?

I used python feedparser to get this item["description"] from a mashable feed:

<img alt="9f4397d9c05e474fa54291507ad9c03a" src="http://rack.2.mshcdn.com/media/ZgkyMDE2LzA0LzI2LzM0LzlmNDM5N2Q5YzA1LjMzODI0LmpwZwpwCXRodW1iCTU3NXgzMjMjCmUJanBn/393b8db2/53c/9f4397d9c05e474fa54291507ad9c03a.jpg" />
<div style="float: right; width: 50px;"><a href="http://twitter.com/share?via=Mashable&amp;text=Nail+polish+stockings+are+exactly+what+you+need+for+a+lazy+summer+pedicure&amp;src=http%3A%2F%2Fmashable.com%2F2016%2F04%2F26%2Ftoe-nail-polish-stockings%2F" style="margin: 10px;"><img alt="Feed-tw" border="0" src="http://rack.1.mshcdn.com/assets/feed-tw-f7c0a094d16b7ee7c91a1e50839a8e00.jpg" /></a><a href="http://www.facebook.com/sharer.php?u=http%3A%2F%2Fmashable.com%2F2016%2F04%2F26%2Ftoe-nail-polish-stockings%2F&amp;src=sp" style="margin: 10px;"><img alt="Feed-fb" border="0" src="http://rack.1.mshcdn.com/assets/feed-fb-c0a21e8841794479b8086c32c6f24ba1.jpg" /></a></div>
<div>
    <p>Say goodbye messy pedicures and hello to finally feeling the sweet freedom of open toed shoes in summer.</p>
    <p>Japanese fashion company <a href="http://www.bellemaison.jp/cpg/fashion/fakenail/fakenail_index.html">Belle Maison</a> has a time saving solution for those of us out there who have little time and little hand coordination for painting our toenails &#8212; thin stockings with pre-painted toenails.</p>
    <div>
        <p>SEE ALSO: <a href="http://mashable.com/2016/02/23/weiner-dog-ear-plugs/">Weiner dog ear plugs will help you sleep deeper than a newborn pup</a></p>
    </div>
    <figure>
        <p><img class="" src="http://rack.1.mshcdn.com/media/ZgkyMDE2LzA0LzI2L2M1L3RvZW5haWxhcnRwLjI4NjBiLmpwZwpwCXRodW1iCTU3NXg0MDk2Pg/4f07495a/b32/toe-nail-art-polish-stockings-japan-10.jpg" /></p>
        <div>
            <p>Image:  belle maison</p>
        </div>
    </figure>
    <p>If you're worried about looking a little out-of-date with the classic stockings and open-toed heels that your grandma used to wear, don't fret. The stockings are designed to fit individual toes, giving your pedicure a better fit as well. <a href="http://mashable.com/2016/04/26/toe-nail-polish-stockings/">Read more...</a></p>
</div>
More about <a href="http://mashable.com/conversations/?utm_campaign=Mash-Prod-RSS-Feedburner-All-Partial&amp;utm_cid=Mash-Prod-RSS-Feedburner-All-Partial">Conversations</a>, <a href="http://mashable.com/pics/?utm_campaign=Mash-Prod-RSS-Feedburner-All-Partial&amp;utm_cid=Mash-Prod-RSS-Feedburner-All-Partial">Pics</a>, <a href="http://mashable.com/category/products/?utm_campaign=Mash-Prod-RSS-Feedburner-All-Partial&amp;utm_cid=Mash-Prod-RSS-Feedburner-All-Partial">Products</a>, <a href="http://mashable.com/lifestyle/?utm_campaign=Mash-Prod-RSS-Feedburner-All-Partial&amp;utm_cid=Mash-Prod-RSS-Feedburner-All-Partial">Lifestyle</a>, and <a href="http://mashable.com/category/weird-products/?utm_campaign=Mash-Prod-RSS-Feedburner-All-Partial&amp;utm_cid=Mash-Prod-RSS-Feedburner-All-Partial">Weird Products</a>

This is an awful lot of information. The part I only really need for readers is this:

<p>Say goodbye messy pedicures and hello to finally feeling the sweet freedom of open toed shoes in summer.</p>
<p>Japanese fashion company <a href="http://www.bellemaison.jp/cpg/fashion/fakenail/fakenail_index.html">Belle Maison</a> has a time saving solution for those of us out there who have little time and little hand coordination for painting our toenails &#8212; thin stockings with pre-painted toenails.</p>

How do I get only this part? Should I just go for python regex? I am not too sure since almost all descriptions are different, so writing an expression for that will be difficult. Is there another RSS item element that only provides the info I want? Thanks!

Upvotes: 0

Views: 175

Answers (2)

C Panda
C Panda

Reputation: 3405

If you want to go the re way, you can do the following

pat = re.compile(r"<div>(.*?)</div>")
s = pat.search(html).group(1)
result = [line.strip() for line in s.strip().splitlines()[:2]]
# result
['<p>Say goodbye messy pedicures and hello to finally feeling the sweet freedom of open toed shoes in summer.</p>',
 '<p>Japanese fashion company <a href="http://www.bellemaison.jp/cpg/fashion/fakenail/fakenail_index.html">Belle Maison</a> has a time saving solution for those of us out there who have little time and little hand coordination for painting our toenails &#8212; thin stockings with pre-painted toenails.</p>']

But as you can see, it's dirty and likely to break. So one solution is to write a grammar and a tiny parser. But the robust and convenient way is to use a parser like Beautifulsoup or lxml.

Upvotes: 1

Georg Grab
Georg Grab

Reputation: 2301

As you guessed correctly, Regex will not be able to accomplish this task (obligatory link to this question). So your best bet is to feed your HTML to a parser like Beautifulsoup, and write your logic for the parsed DOM Object.

from bs4 import BeautifulSoup 
soup = BeautifulSoup(my_input_html_string)
my_elements = soup.find_all('p')[0:2]

Obviously, this code assumes that you're always looking for the first two <p>'s in any given DOM you feed to it. You'll have to adjust your logic based on a consistency found by looking at different descriptions that your input provides.

Upvotes: 1

Related Questions