Andrew Ebling
Andrew Ebling

Reputation: 10283

Convert HTML to plain text in Swift (without using NSAttributedString)

I need to convert some HTML into plain text and have tried the approaches outlined here:

Convert HTML to Plain Text in Swift

The problem is that on iOS 8.2, NSAttributedString has a bug which can result in a EXC_BAD_ACCESS crash (deep inside WebKit) when HTML is rendered to plain text on a background thread. The conversion needs to be done on a background thread because it can (and usually does) take a while.

So I need a more primitive solution in Swift, ideally an idiomatic one.

It also strikes me that this is probably one of those problems which has an elegant and neat functional solution - it's essentially a filter() operation on a String surely?

Upvotes: 4

Views: 4015

Answers (2)

static0886
static0886

Reputation: 784

somewhat late to the party, but thought it would benefit other visitors...

basically I have taken the solution from here and transformed it into Swift 3 syntax.

Solution uses Scanner (previously NSScanner) to find occurences of "<" then scans into the ">" taking everything in between into consideration, saving it into an NSString variable. Then just use replacingOccurrences(of:with:) passing in the NSString variable

Here's what the final function looks like:

private func stripHTML(fromString rawString: String) -> String {
    let scanner: Scanner = Scanner(string: rawString)
    var text: NSString? = ""
    var convertedString = rawString
    while !scanner.isAtEnd {
        scanner.scanUpTo("<", into: nil)
        scanner.scanUpTo(">", into: &text)
        convertedString = convertedString.replacingOccurrences(of: "\(text!)>", with: "")
    }

    return convertedString
}

Upvotes: 2

Andrew Ebling
Andrew Ebling

Reputation: 10283

The best solution I have come up with is a regex within a String extension, which is adequate for dealing with the HTML fragments I need to deal with:

extension String {
    func plainTextFromHTML() -> String? {

        let regexPattern = "<.*?>"
        var err: NSError?

        if let stripHTMLRegex = NSRegularExpression(pattern: regexPattern, options: NSRegularExpressionOptions.CaseInsensitive, error: &err) {

            let plainText = stripHTMLRegex.stringByReplacingMatchesInString(self, options: NSMatchingOptions.ReportProgress, range: NSMakeRange(0, count(self)), withTemplate: "")

            return err == nil ? plainText : nil
        } else {
            println("Warning: failed to create regular expression from pattern: \(regexPattern)")
            return nil
        }
    }
}

Swift 2.2

extension String {
    func plainTextFromHTML() -> String? {
        let regexPattern = "<.*?>"
        do {
            let stripHTMLRegex = try NSRegularExpression(pattern: regexPattern, options: NSRegularExpressionOptions.CaseInsensitive)
            let plainText = stripHTMLRegex.stringByReplacingMatchesInString(self, options: NSMatchingOptions.ReportProgress, range: NSMakeRange(0, self.characters.count), withTemplate: "")
            return plainText
        } catch {
            print("Warning: failed to create regular expression from pattern: \(regexPattern)")
            return nil
        }
    }
}

A more advanced solution will be needed for full HTML conversion to plain text however.

Upvotes: 1

Related Questions