Using lxml.html to parse large html document

Question

I am trying to use lxml to get the two fields: iPhone and 29,906.

This is part of a much much bigger html file.

I have found how to extract the font in each td, but I need to be able to match the iPhone field and the 29,906 field.

One way I can think of is put everything into a really long array and search for "iPhone" and return the iPhone + 2 value, but this seems really long winded and inefficient.

Can anyone please guide me in the right direction?

This is what I have so far:

from bs4 import BeautifulSoup
import requests
from lxml import html, cssselect

link =    "https://www.sec.gov/Archives/edgar/data/320193/000032019318000100/a10-qq320186302018.htm"
response = requests.get(link)
soup = BeautifulSoup(response.text, 'html.parser')
str_soup = str(soup)
doc = html.document_fromstring(str_soup)
for col in doc.cssselect('font'):
    try:
        style = col.attrib['style']
        if style=="font-family:Helvetica,sans-serif;font-size:9pt;":
            print(col.text.strip())
    except:
        pass

This returns all the text but not how I need it.

hdizzle · Accepted Answer

I didn't get exactly what I wanted, but this what I can come up with so far to build off of

from bs4 import BeautifulSoup
import requests
from lxml import html, cssselect
import csv


link = "https://www.sec.gov/Archives/edgar/data/320193/000032019318000100/a10-qq320186302018.htm"
response = requests.get(link)
soup = BeautifulSoup(response.text, 'html.parser')
str_soup = str(soup)
doc = html.document_fromstring(str_soup)


with open('AAPL_financials.csv', 'w') as csvfile:
    writer = csv.writer(csvfile)
    for col in doc.cssselect('tr'):
        row = []
        for text in col.cssselect('font'):
            if text.text == None:
                continue
            value = text.text.strip()
            if value == "":
                continue
            if value == "$":
                continue
            if value == "%":
                continue
            if value == ")":
                continue
            if value[0] == "(":
                value = value.replace("(", "-"))
            row.append(value)
        writer.writerow(row)

Using lxml.html to parse large html document

Answers (2)

Related Questions