asanas
asanas

Reputation: 4280

NodeJS: Extract a sentence from html text based on a phrase

I have some text stored in a database, which looks something like below:

let text = "<p>Some people live so much in the future they they lose touch with reality.</p><p>They don't just <strong>lose touch</strong> with reality, they get obsessed with the future.</p>"

The text can have many paragraphs and HTML tags.

Now, I also have a phrase:

let phrase = 'lose touch'

What I want to do is search for the phrase in text, and return the complete sentence containing the phrase in strong tag.

In the above example, even though the first para also contains the phrase 'lose touch', it should return the second sentence because it is in the second sentence that the phrase is inside strong tag. The result will be:

They don't just <strong>lose touch</strong> with reality, they get obsessed with the future.

On the client-side, I could create a DOM tree with this HTML text, convert it into an array and search through each item in the array, but in NodeJS document is not available, so this is basically just plain text with HTML tags. How do I go about finding the right sentence in this blob of text?

Upvotes: 5

Views: 562

Answers (3)

planet_hunter
planet_hunter

Reputation: 3966

I think this might help you.

No need to involve DOM in this if I understood the problem correctly.

This solution would work even if the p or strong tags have attributes in them.

And if you want to search for tags other than p, simply update the regex for it and it should work.

const search_phrase = "lose touch";
const strong_regex = new RegExp(`<\s*strong[^>]*>${search_phrase}<\s*/\s*strong>`, "g");
const paragraph_regex = new RegExp("<\s*p[^>]*>(.*?)<\s*/\s*p>", "g");
const text = "<p>Some people live so much in the future they they lose touch with reality.</p><p>They don't just <strong>lose touch</strong> with reality, they get obsessed with the future.</p>";

const paragraphs = text.match(paragraph_regex);

if (paragraphs && paragraphs.length) {
    const paragraphs_with_strong_text =  paragraphs.filter(paragraph => {
        return strong_regex.test(paragraph);
    });
    console.log(paragraphs_with_strong_text);
    // prints [ '<p>They don\'t just <strong>lose touch</strong> with reality, they get obsessed with the future.</p>' ]
}

P.S. The code is not optimised, you can change it as per the requirement in your application.

Upvotes: 2

MeowBlock
MeowBlock

Reputation: 11

first you could var arr = text.split("<p>") in order to be able to work with each sentence individually

then you could loop through your array and search for your phrase inside strong tags

for(var i = 0; i<arr.length;i++){ if(arr[i].search("<strong>"+phrase+"</strong>")!=-1){ console.log("<p>"+arr[i]); //arr[i] is the the entire sentence containing phrase inside strong tags minus "<p>" } }

Upvotes: 0

Alim Giray Aytar
Alim Giray Aytar

Reputation: 1474

There is cheerio which is something like server-side jQuery. So you can get your page as text, build DOM, and search inside of it.

Upvotes: 0

Related Questions