Reputation: 5197
I'm trying to figure out why html2text is breaking my HTML:
<div><table> <tbody> <tr> <td> <span><strong><a href="/pages/about_paul_221673.cfm"><span>About</span></a></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><a href="/pages/contact_us_222511.cfm"><span>Contact</span></a></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><a><span>Maths Games Order</span></a></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><a href="/pages/faqs_222510.cfm"><span>FAQ</span></a></strong></span></td> </tr> </tbody> </table>s<div> <span><strong>Broadbent Maths Ltd<br> 3 High Street, Welbourn, Lincoln, LN5 0NH </strong></span></div> </div>
Processing it with:
cat "/home/spider/original-file.txt" | html2text -utf8 -nobs -style pretty
When I run that, I get:
nput recoding failed due to invalid input sequence. Unconverted part of text follows. ▒Contact ▒Maths Games Order ▒FAQ
s Broadbent Maths Ltd 3 High Street, Welbourn, Lincoln, LN5 0NH
When I run Devel::Peek::Dump()
(Perl), I see the string as:
SV = PV(0x564c0a72c860) at 0x564c09967c80
REFCNT = 1
FLAGS = (POK,IsCOW,pPOK,UTF8)
PV = 0x564c0a58bc60 "\n<div><table> <tbody> <tr> <td> <span><strong><a href=\"/pages/about_paul_221673.cfm\"><span>About</span></a></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><a href=\"/pages/contact_us_222511.cfm\"><span>Contact</span></a></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><a><span>Maths Games Order</span></a></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><a href=\"/pages/faqs_222510.cfm\"><span>FAQ</span></a></strong></span></td> </tr> </tbody> </table>s<div> <span><strong>Broadbent Maths Ltd<br> 3 High Street, Welbourn, Lincoln, LN5 0NH </strong></span></div> </div>\n"\0 [UTF8 "\n<div><table> <tbody> <tr> <td> <span><strong><a href="/pages/about_paul_221673.cfm"><span>About</span></a></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><a href="/pages/contact_us_222511.cfm"><span>Contact</span></a></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><a><span>Maths Games Order</span></a></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><a href="/pages/faqs_222510.cfm"><span>FAQ</span></a></strong></span></td> </tr> </tbody> </table>s<div> <span><strong>Broadbent Maths Ltd<br> 3 High Street, Welbourn, Lincoln, LN5 0NH </strong></span></div> </div>\n"]
CUR = 725
LEN = 736
COW_REFCNT = 1
If I remove the first bit:
<div><table>
It works fine! I don't get why its breaking there though - all seems ok to me?
Upvotes: 1
Views: 339
Reputation: 5197
Ok I think I've worked it out. In this case, for some reason `• was breaking it. I replaced that with "-", and it works now
html2text -utf8 -nobs -o test-out.txt test.co.uk.txt
It's a bit weird that html2text breaks with HTML entities though?
UPDATE: The problem turned out to be that while they were serving the page as utf-8
with the meta, it was being passed along as iso-8859-1
from the server. So what I did was parse out the server header and compare it before saving - then if it was windows-1252
, then I would use this command instead of parse it out:
html2text -ansi -nobs -o test-out.txt test.co.uk.txt
Upvotes: 1