Reputation: 4902
This small program:
from lxml.html import tostring, fromstring
e = fromstring('''
<html><head>
<link href="/comments.css" rel="stylesheet" type="text/css">
<link href="/index.css" rel="stylesheet" type="text/css">
</head>
<body>
<span></span>
<span></span>
</body>
</html>''')
print (tostring(e, encoding=str)) #unicode on python 2
will print:
<html><head><link href="/comments.css" rel="stylesheet" type="text/css"><link
href="/index.css" rel="stylesheet" type="text/css"></head><body>
<span></span>
<span></span>
</body></html>
The spaces and line breaks in head removed. This happens even if we place the two <link> elements in <body>. It seems blank text nodes (\s*) between head elements are removed.
How I can preserve spaces and line breaks between <link>s? (I expect output to be exactly same as input)
Upvotes: 6
Views: 4930
Reputation: 4902
Finally, I used html5lib to parse html and generate lxml like tree with it.
parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("lxml"), namespaceHTMLElements=False)
Upvotes: 1
Reputation: 19645
for me
print (tostring(e, encoding=str))
returns
>>> print (tostring(e, encoding=str))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 1493, in tostring
encoding=encoding)
File "lxml.etree.pyx", line 2836, in lxml.etree.tostring (src/lxml/lxml.etree.c:53416)
TypeError: descriptor 'upper' of 'str' object needs an argument
I cannot speak to the descrepencey, but I do suggest setting the argument pretty_print
to true
>>> etree.tostring(e, pretty_print=True)
'<html>\n <head>\n <link href="/comments.css" rel="stylesheet" type="text/css"/>\n <link href="/index.css" rel="stylesheet" type="text/css"/>\n </head>\n <body>\n <span/>\n <span/>\n </body>\n</html>\n'
you will need to import etree from lxml import etree
when outputted to an outfile the spaces and newlines will be perserved. Also with print
>>> print(etree.tostring(e, pretty_print=True))
<html>
<head>
<link href="/comments.css" rel="stylesheet" type="text/css"/>
<link href="/index.css" rel="stylesheet" type="text/css"/>
</head>
<body>
<span/>
<span/>
</body>
</html>
I am sure you have checked out the API, but incase you haven't here is information on tostring(). It is also safe to assume you have seen the tutorial on the lxml website. I would love to see some more 'good' resources. I am new to lxml myself and anything new and good to read would be welcomed.
Updated
you said you wouldconsider sed
if you could not find a good python solution.
this should accomplish it with sed
sed -i '1,2d;' input.html; sed -i '1 i\<html><head>' input.html
this is running two sed
procedures. the first deletes the first 2 lines. the second inserts <html><head>
on the first line.
UPDATE #2
I should have thought about this more. you can do this with python
>>> import re
>>> newString = re.sub('\n ', '', etree.tostring(e,encoding=unicode,pretty_print=True), count=1)
>>> print(newString)
<html><head>
<link href="/comments.css" rel="stylesheet" type="text/css"/>
<link href="/index.css" rel="stylesheet" type="text/css"/>
</head>
<body>
<span/>
<span/>
</body>
</html>
Upvotes: 2