MrJacket
MrJacket

Reputation: 391

How to get text between two strings with special characters in ruby?

I have a string (@description) that contains HTML code and I want to extract the content between two elements. It looks something like this

<b>Content title<b><br/>
*All the content I want to extract*
<a href="javascript:print()">

I've managed to do something like this

@want = @description.match(/Content title(.*?)javascript:print()/m)[1].strip

But obviously this solution is far from perfect as I get some unwanted characters in my @want string.

Thanks for your help

Edit:

As requested in the comments, here is the full code:

I'm already parsing an HTML document doing something where the following code:

@description = @doc.at_css(".entry-content").to_s
puts @description

returns:

<div class="post-body entry-content">
<a href="http://www.photourl"><img alt="Photo title" height="333"     src="http://photourl.com" width="500"></a><br><br><div style="text-align: justify;">
Some text</div>
<b>More text</b><br><b>More text</b><br><br><ul>
<li>Numered item</li>
<li>Numered item</li>
<li>Numered item</li>
</ul>
<br><b>Content Title</b><br>
Some text<br><br>
Some text(with links and images)<br>
Some text(with links and images)<br>
Some text(with links and images)<br>
<br><br><a href="javascript:print()"><img src="http://url.com/photo.jpg"></a>
<div style="clear: both;"></div>
</div>

The text can include more paragraphs, links, images, etc. but it always starts with the "Content Title" part and ends with the javascript reference.

Upvotes: 1

Views: 876

Answers (2)

Gilles Qu&#233;not
Gilles Qu&#233;not

Reputation: 185025

To test your HTML, I have added tags around your code then pasting it in a file

xmllint --html --xpath '/html/body/div/text()' /tmp/l.html

output :

Some text
Some text
Some text
Some text

Now, you can use an Xpath module in ruby and re-use the Xpath expression

You will find many examples on stackoverflow website searches.

Upvotes: 0

Dimitre Novatchev
Dimitre Novatchev

Reputation: 243469

This XPath expression selects all (sibling) nodes between the nodes $vStart and $vEnd:

  $vStart/following-sibling::node()
           [count(.|$vEnd/preceding-sibling::node())
           =
            count($vEnd/preceding-sibling::node())
           ]

To obtain the full XPath expression to use in your specific case, simply substitute $vStart with:

/*/b[. = 'Content Title']

and substitute $vEnd with:

/*/a[@href = 'javascript:print()']

The final XPath expressions after the substitutions is:

/*/b[. = 'Content Title']/following-sibling::node()
         [count(.|/*/a[@href = 'javascript:print()']/preceding-sibling::node())
         =
          count(/*/a[@href = 'javascript:print()']/preceding-sibling::node())
         ]

Explanation:

This is a simple corollary of the Kayessian formula for the intersection of two nodesets $ns1 and $ns2:

$ns1[count(.|$ns2) = count($ns2)]

In our case, the set of all nodes between the nodes $vStart and $vEnd is the intersection of two node-sets: all following siblings of $vStart and all preceding siblings of $vEnd.

XSLT - based verification:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:variable name="vStart" select="/*/b[. = 'Content Title']"/>
 <xsl:variable name="vEnd" select="/*/a[@href = 'javascript:print()']"/>

 <xsl:template match="/">
     <xsl:copy-of select=
     "$vStart/following-sibling::node()
               [count(.|$vEnd/preceding-sibling::node())
               =
                count($vEnd/preceding-sibling::node())
               ]
     "/>
==============

     <xsl:copy-of select=
     "/*/b[. = 'Content Title']/following-sibling::node()
               [count(.|/*/a[@href = 'javascript:print()']/preceding-sibling::node())
               =
                count(/*/a[@href = 'javascript:print()']/preceding-sibling::node())
               ]
     "/>
 </xsl:template>
</xsl:stylesheet>

When this transformation is applied on the provided XML document (converted to a well-formed XML document):

<div class="post-body entry-content">
    <a href="http://www.photourl">
        <img alt="Photo title" height="333"     src="http://photourl.com" width="500"/>
    </a>
    <br />
    <br />
    <div style="text-align: justify;">
    Some text</div>
    <b>More text</b>
    <br />
    <b>More text</b>
    <br />
    <br />
    <ul>
        <li>Numered item</li>
        <li>Numered item</li>
        <li>Numered item</li>
    </ul>
    <br />
    <b>Content Title</b>
    <br />
    Some text
    <br />
    <br />
    Some text(with links and images)
    <br />
    Some text(with links and images)
    <br />
    Some text(with links and images)
    <br />
    <br />
    <br />
    <a href="javascript:print()">
        <img src="http://url.com/photo.jpg"/>
    </a>
    <div style="clear: both;"></div>
</div>

the two XPath expressions (with and without variable references) are evaluated and the nodes selected in each case, conveniently delimited, are copied to the output:

<br/>
    Some text
    <br/>
<br/>
    Some text(with links and images)
    <br/>
    Some text(with links and images)
    <br/>
    Some text(with links and images)
    <br/>
<br/>
<br/>
==============

     <br/>
    Some text
    <br/>
<br/>
    Some text(with links and images)
    <br/>
    Some text(with links and images)
    <br/>
    Some text(with links and images)
    <br/>
<br/>
<br/>

Upvotes: 1

Related Questions