Andrew Newby
Andrew Newby

Reputation: 5197

html2text command line breaking html

I'm trying to figure out why html2text is breaking my HTML:

<div><table> <tbody> <tr> <td> <span><strong><a href="/pages/about_paul_221673.cfm"><span>About</span></a></strong></span></td> <td> <span><strong><span>&bull;</span></strong></span></td> <td> <span><strong><a href="/pages/contact_us_222511.cfm"><span>Contact</span></a></strong></span></td> <td> <span><strong><span>&bull;</span></strong></span></td> <td> <span><strong><a><span>Maths Games Order</span></a></strong></span></td> <td> <span><strong><span>&bull;</span></strong></span></td> <td> <span><strong><a href="/pages/faqs_222510.cfm"><span>FAQ</span></a></strong></span></td> </tr> </tbody> </table>s<div> <span><strong>Broadbent Maths Ltd<br> 3 High Street, Welbourn, Lincoln, LN5 0NH </strong></span></div> </div>

Processing it with:

cat "/home/spider/original-file.txt" | html2text -utf8 -nobs -style pretty

When I run that, I get:

nput recoding failed due to invalid input sequence. Unconverted part of text follows. ▒Contact ▒Maths Games Order ▒FAQ

s Broadbent Maths Ltd 3 High Street, Welbourn, Lincoln, LN5 0NH

When I run Devel::Peek::Dump() (Perl), I see the string as:

SV = PV(0x564c0a72c860) at 0x564c09967c80
  REFCNT = 1
  FLAGS = (POK,IsCOW,pPOK,UTF8)
  PV = 0x564c0a58bc60 "\n<div><table> <tbody> <tr> <td> <span><strong><a href=\"/pages/about_paul_221673.cfm\"><span>About</span></a></strong></span></td> <td> <span><strong><span>&bull;</span></strong></span></td> <td> <span><strong><a href=\"/pages/contact_us_222511.cfm\"><span>Contact</span></a></strong></span></td> <td> <span><strong><span>&bull;</span></strong></span></td> <td> <span><strong><a><span>Maths Games Order</span></a></strong></span></td> <td> <span><strong><span>&bull;</span></strong></span></td> <td> <span><strong><a href=\"/pages/faqs_222510.cfm\"><span>FAQ</span></a></strong></span></td> </tr> </tbody> </table>s<div> <span><strong>Broadbent Maths Ltd<br> 3 High Street, Welbourn, Lincoln, LN5 0NH </strong></span></div> </div>\n"\0 [UTF8 "\n<div><table> <tbody> <tr> <td> <span><strong><a href="/pages/about_paul_221673.cfm"><span>About</span></a></strong></span></td> <td> <span><strong><span>&bull;</span></strong></span></td> <td> <span><strong><a href="/pages/contact_us_222511.cfm"><span>Contact</span></a></strong></span></td> <td> <span><strong><span>&bull;</span></strong></span></td> <td> <span><strong><a><span>Maths Games Order</span></a></strong></span></td> <td> <span><strong><span>&bull;</span></strong></span></td> <td> <span><strong><a href="/pages/faqs_222510.cfm"><span>FAQ</span></a></strong></span></td> </tr> </tbody> </table>s<div> <span><strong>Broadbent Maths Ltd<br> 3 High Street, Welbourn, Lincoln, LN5 0NH </strong></span></div> </div>\n"]
  CUR = 725
  LEN = 736
  COW_REFCNT = 1

If I remove the first bit:

<div><table>

It works fine! I don't get why its breaking there though - all seems ok to me?

Upvotes: 1

Views: 339

Answers (1)

Andrew Newby
Andrew Newby

Reputation: 5197

Ok I think I've worked it out. In this case, for some reason `• was breaking it. I replaced that with "-", and it works now

html2text -utf8 -nobs -o test-out.txt test.co.uk.txt

It's a bit weird that html2text breaks with HTML entities though?

UPDATE: The problem turned out to be that while they were serving the page as utf-8 with the meta, it was being passed along as iso-8859-1 from the server. So what I did was parse out the server header and compare it before saving - then if it was windows-1252, then I would use this command instead of parse it out:

html2text -ansi -nobs -o test-out.txt test.co.uk.txt

Upvotes: 1

Related Questions