Reputation: 22820
im using var tmp_title = $('title').text();
with cheerio.js
to get a title from a page.
question, is there anything that could normalizse a string or remove html entities like \n\t
or \n
etc?
Example
\n\t defense.gov news article: thousands lay wreaths at arlington cemetery gravesites\n
Into
Thousand lay wreaths at arlington cemetery gravesites
or is there a way to get the title from a page? how can google now that the title is at <h3>
tag or does google crawler get the title from <title>
tag and remove and normalize title to get a readable title string?
Upvotes: 0
Views: 596
Reputation: 452
I would make some analysis between:
Then the "analysis" could be as basic as
Or, you don't mind relying on some saas web service, you could have a look at http://www.diffbot.com/ .
Upvotes: 1