Aaron
Aaron

Reputation: 458

Is there a better way then using Lynx to convert HTML to Plaintext reliably in PHP

I want to convert a HTML file with a table based layout to plaintext in order to send a multipart email via PHP.

I have tried a few different pre built classes / functions that I've found on SO, but none of them seem to produce decent results, which I believe is down to the table-based layout.

I don't want to roll my own class for stripping HTML and formatting the results as I am sure there are edge issues which I won't account for or be able to test until I come across them in production.

The best solution I've come up with so far is:

  1. Create a temporary HTML file
  2. Use something like shell_exec("/path/to/lynx -dump temporary.html"); to create a plaintext version of the email
  3. Use some regex to get rid of any remaining unwanted tags

This works fine, but I'm a little worried that its not the optimal way of achieving a decent multipart email. Is anyone aware of a better way?

To clarify, I have already tried the following without success:

Upvotes: 2

Views: 2213

Answers (2)

DhruvPathak
DhruvPathak

Reputation: 43265

PHP DomDocument should help you in this. You can traverse the DOM tree and strip out relevant content as you want.

http://php.net/manual/en/class.domdocument.php

Related question on SO :

Parse HTML with PHP's HTML DOMDocument

Upvotes: 1

Mr. BeatMasta
Mr. BeatMasta

Reputation: 1312

Lynx is not the best solution as I truly believe :) Also, I've used html2text myself and it works fine and is better than lynx.. anyway, if you prefer regexing it would rather be much more heavy than using the system shell (shell_exec, system, exec, popen), as you need to preg_replace all unnecessary tags, and in php regex is deadly slow. So I guess if it's on linux machine it's better to pass to html2text..

Upvotes: 2

Related Questions