Mark K
Mark K

Reputation: 9348

Python - to output contents in a HTML file to spreadsheet

Part of below is sourced from another example. It’s modified a bit and use to read a HTML file, and output the contents into a spreadsheet.

As it’s a just a local file, using Selenium is maybe an over-kill, but I just want to learn through this example.

from selenium import webdriver
import lxml.html as LH
import lxml.html.clean as clean
import xlwt

book = xlwt.Workbook(encoding='utf-8', style_compression = 0)
sheet = book.add_sheet('SeaWeb', cell_overwrite_ok = True)

driver = webdriver.PhantomJS()
ignore_tags=('script','noscript','style')

results = []

driver.get("source_file.html")
content = driver.page_source
cleaner = clean.Cleaner()
content = cleaner.clean_html(content)
doc = LH.fromstring(content)

for elt in doc.iterdescendants():
    if elt.tag in ignore_tags: continue
    text = elt.text or ''                                 #question 1
    tail = elt.tail or ''                                 #question 1
    words = ''.join((text,tail)).strip()
    if words:                                   # extra question
        words = words.encode('utf-8')                     #question 2
        results.append(words)                             #question 3
        results.append('; ')                              #question 3

sheet.write (0, 0, results)

book.save("C:\\ source_output.xls")
  1. The lines text=elt.text or '' and tail=elt.tail or '' – why both .text and .tail have texts? And why the or '' part is important here?
  2. The texts in the HTML file contains special characters like ° (temperature degrees) – the .encode('utf-8') doesn’t make it a perfect output, neither in IDLE or Excel spreadsheet. What’s the alternative?
  3. Is it possible to join the output into a string, instead of a list? Now to append it into a list, I have to .append it twice to have the texts and ; added.

Upvotes: 1

Views: 164

Answers (2)

kums
kums

Reputation: 2691

A simple example for Q1

from lxml import etree
test = etree.XML("<main>placeholder</main>")
print test.text #prints placeholder
print test.tail #prints None
print test.tail or ''  #prints empty string

test.text = "texter"
print etree.tostring(test) #prints <main>texter</main>

test.tail = "tailer"
print etree.tostring(test) #prints <main>texter</main>tailer

Upvotes: 1

sk11
sk11

Reputation: 1824

elt is a html node. It contains certain attributes and a text section. lxml provides way to extract all the attributes and text, by using .text or .tail depending where the text is.

<a attribute1='abc'> 
    some text     ----> .text gets this
    <p attributeP='def'> </p>
    some tail     ---> .tail gets this 
</a>

The idea behind the or ''is that if there is no text/tail found in the current html node, it returns None. And later when we want to concatenate/append None type it will complain. So to avoid any future error, if the text/tail is None then use an empty string ''


Degree character is a one-character unicode string, but when you do a .encode('utf-8') it becomes 2-byte utf-8 byte string. This 2-byte is nothing but ° or \xc3\x82\xc2\xb0. So basically you do not have to do any encoding for ° character and Python interpreter correctly interprets the encoding. If not, provide the correct shebang on top of your python script. Check the PEP-0263

# -*- coding: UTF-8 -*-

Yes you can also join the output in string, just use + as there is no append for string types for e.g.

results = ''
results = results + 'whatever you want to join'

You can keep the list and combine your 2 lines:

results.append(words + '; ')

Note: Just now i checked the xlwt documentation and sheet.write() accept only strings. So basically you cannot pass results, a list type.

Upvotes: 1

Related Questions