Reputation: 8548
Given a webpage, I would like to extract the text for a reader view. I am aware that SFSafariViewController
offers a reader mode, but for my application, I need the actual text string. I am also aware of the Mercury parser, but I prefer a solution that runs locally.
I have tried many options:
DZReadability (it works but the output is oftentimes not very good, much worse than the reader of Safari)
Mozilla Readability (I could not make it run under iOS)
luin/Readability
looks very interesting. It seems to be a very active Github project. However, I could not make it work under iOS. What I tried/did:
I installed and used browserify
to get a stand-alone JavaScript file. However, I got an error message Error: Mismatched anonymous define() module
. I read that this problem may be solved by using derequire
. I tried it but did not succeed.
Can anyone give me some advice on how to make luin/Readability work on iOS, possibly by using browserify
or in any other way?
Upvotes: 0
Views: 470
Reputation: 333
I had similar problem in my project that needed to render HTML from Readability as TextView. My initial approach was rendering using WKWebView
by injecting slightly modified Mozilla Readability using evaluateJavaScript
of WKWebView
.
Mozilla Readability code was stored as local file and was modified by appending the following code:
// Execute Readbility on the currently loaded DOM
var uri = {
spec: location.href,
host: location.host,
prePath: location.protocol + "//" + location.host,
scheme: location.protocol.substr(0, location.protocol.indexOf(":")),
pathBase: location.protocol + "//" + location.host + location.pathname.substr(0, location.pathname.lastIndexOf("/") + 1)
}; var documentClone = document.cloneNode(true); var article = new Readability(uri, documentClone).parse(); article;
The resulting content is then rendered using DTCoreText
. WKWebView
will load all resources of the webpage including all images, ads etc. This makes the approach very memory intensive, I tried circumventing this by parsing and removing images before passing it to WKWebView
. Overall this works, but depending on your use case might not be very elegant or fast.
Currently I'm using a different approach, which involves running luis Readability on a server using phantomJS, which gives better results in terms of content extraction and is much less memory intensive on the client.
Upvotes: 3