Reputation: 107
<td style="vertical-align:bottom;background-color:#efefef;padding-left:2px;padding-top:2px;padding-bottom:2px;padding-right:2px;">
<div style="text-indent:26px;font-size:9pt;">
<font style="font-family:Helvetica,sans-serif;font-size:9pt;">
iPhone
</font>
<font style="font-family:Helvetica,sans-serif;font-size:9pt;">
<sup style="vertical-align:top;line-height:120%;font-size:pt">
(1)
</sup>
</font>
</div>
</td>
<td style="vertical-align:bottom;padding-left:2px;padding-top:2px;padding-bottom:2px;background-color:#efefef;">
<div style="text-align:left;font-size:9pt;">
<font style="font-family:Helvetica,sans-serif;font-size:9pt;">
$
</font>
</div>
</td>
<td style="vertical-align:bottom;background-color:#efefef;padding-top:2px;padding-bottom:2px;">
<div style="text-align:right;font-size:9pt;">
<font style="font-family:Helvetica,sans-serif;font-size:9pt;">
29,906
</font>
</div>
</td>
<td style="vertical-align:bottom;background-color:#efefef;">
<div style="text-align:left;font-size:10pt;">
<font style="font-family:inherit;font-size:10pt;">
<br/>
</font>
</div>
</td>
I am trying to use lxml to get the two fields: iPhone and 29,906.
This is part of a much much bigger html file.
I have found how to extract the font in each td, but I need to be able to match the iPhone field and the 29,906 field.
One way I can think of is put everything into a really long array and search for "iPhone" and return the iPhone + 2 value, but this seems really long winded and inefficient.
Can anyone please guide me in the right direction?
This is what I have so far:
from bs4 import BeautifulSoup
import requests
from lxml import html, cssselect
link = "https://www.sec.gov/Archives/edgar/data/320193/000032019318000100/a10-qq320186302018.htm"
response = requests.get(link)
soup = BeautifulSoup(response.text, 'html.parser')
str_soup = str(soup)
doc = html.document_fromstring(str_soup)
for col in doc.cssselect('font'):
try:
style = col.attrib['style']
if style=="font-family:Helvetica,sans-serif;font-size:9pt;":
print(col.text.strip())
except:
pass
This returns all the text but not how I need it.
Upvotes: 0
Views: 242
Reputation: 107
I didn't get exactly what I wanted, but this what I can come up with so far to build off of
from bs4 import BeautifulSoup
import requests
from lxml import html, cssselect
import csv
link = "https://www.sec.gov/Archives/edgar/data/320193/000032019318000100/a10-qq320186302018.htm"
response = requests.get(link)
soup = BeautifulSoup(response.text, 'html.parser')
str_soup = str(soup)
doc = html.document_fromstring(str_soup)
with open('AAPL_financials.csv', 'w') as csvfile:
writer = csv.writer(csvfile)
for col in doc.cssselect('tr'):
row = []
for text in col.cssselect('font'):
if text.text == None:
continue
value = text.text.strip()
if value == "":
continue
if value == "$":
continue
if value == "%":
continue
if value == ")":
continue
if value[0] == "(":
value = value.replace("(", "-"))
row.append(value)
writer.writerow(row)
Upvotes: 0
Reputation: 6639
How about this?
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(html, 'html.parser')
x = soup.find_all('font')
name = re.sub(r"[\n\t\s]*", "", x[0].get_text())
value = re.sub(r"[\n\t\s]*", "", x[3].get_text())
print(name, 'costs', value)
Output:
iPhone costs 29,906
Upvotes: 1