Reputation: 9348
Part of below is sourced from another example. It’s modified a bit and use to read a HTML file, and output the contents into a spreadsheet.
As it’s a just a local file, using Selenium is maybe an over-kill, but I just want to learn through this example.
from selenium import webdriver
import lxml.html as LH
import lxml.html.clean as clean
import xlwt
book = xlwt.Workbook(encoding='utf-8', style_compression = 0)
sheet = book.add_sheet('SeaWeb', cell_overwrite_ok = True)
driver = webdriver.PhantomJS()
ignore_tags=('script','noscript','style')
results = []
driver.get("source_file.html")
content = driver.page_source
cleaner = clean.Cleaner()
content = cleaner.clean_html(content)
doc = LH.fromstring(content)
for elt in doc.iterdescendants():
if elt.tag in ignore_tags: continue
text = elt.text or '' #question 1
tail = elt.tail or '' #question 1
words = ''.join((text,tail)).strip()
if words: # extra question
words = words.encode('utf-8') #question 2
results.append(words) #question 3
results.append('; ') #question 3
sheet.write (0, 0, results)
book.save("C:\\ source_output.xls")
text=elt.text or ''
and tail=elt.tail or ''
– why both .text
and .tail
have texts? And why the or ''
part is important here?°
(temperature degrees) – the .encode('utf-8')
doesn’t make it a perfect output, neither in IDLE or Excel spreadsheet. What’s the alternative?.append
it twice to have the texts and ;
added.Upvotes: 1
Views: 164
Reputation: 2691
A simple example for Q1
from lxml import etree
test = etree.XML("<main>placeholder</main>")
print test.text #prints placeholder
print test.tail #prints None
print test.tail or '' #prints empty string
test.text = "texter"
print etree.tostring(test) #prints <main>texter</main>
test.tail = "tailer"
print etree.tostring(test) #prints <main>texter</main>tailer
Upvotes: 1
Reputation: 1824
elt
is a html node. It contains certain attributes
and a text
section. lxml
provides way to extract all the attributes and text, by using .text
or .tail
depending where the text is.
<a attribute1='abc'>
some text ----> .text gets this
<p attributeP='def'> </p>
some tail ---> .tail gets this
</a>
The idea behind the or ''
is that if there is no text/tail found in the current html node, it returns None
. And later when we want to concatenate/append None
type it will complain. So to avoid any future error, if the text/tail is None
then use an empty string ''
Degree character is a one-character unicode string, but when you do a .encode('utf-8')
it becomes 2-byte utf-8 byte string. This 2-byte is nothing but °
or \xc3\x82\xc2\xb0
. So basically you do not have to do any encoding for °
character and Python interpreter correctly interprets the encoding. If not, provide the correct shebang on top of your python script. Check the PEP-0263
# -*- coding: UTF-8 -*-
Yes you can also join the output in string, just use +
as there is no append
for string types for e.g.
results = ''
results = results + 'whatever you want to join'
You can keep the list and combine your 2 lines:
results.append(words + '; ')
Note: Just now i checked the xlwt
documentation and sheet.write()
accept only strings. So basically you cannot pass results
, a list type.
Upvotes: 1