Reputation: 33
I'm trying to extract the text inside all lines ('li') from the following:
<ul id="tco_detail_data">
<li>
<ul class="list-title">
<li class="first"> </li>
<li>Year 1</li>
<li>Year 2</li>
<li>Year 3</li>
<li>Year 4</li>
<li>Year 5</li>
<li class="last">5 Yr Total</li>
</ul>
</li>
<hr class="loose-dotted" />
<li class="first">
<ul class="first">
<li class="first">Depreciation</li>
<li>$5,390</li>
<li>$1,658</li>
<li>$1,459</li>
<li>$1,293</li>
<li>$1,161</li>
<li class="last">$10,961</li>
</ul>
</li>
<hr class="loose-dotted" />
<li>
<ul>
<li class="first">Taxes & Fees</li>
<li>$1,424</li>
<li>$61</li>
<li>$61</li>
<li>$61</li>
<li>$61</li>
<li class="last">$1,668</li>
</ul>
</li>
<hr class="loose-dotted" />
<li>
<ul>
<li class="first">Financing</li>
<li>$1,022</li>
<li>$817</li>
<li>$603</li>
<li>$375</li>
<li>$135</li>
<li class="last">$2,952</li>
</ul>
To get to this point I used the following:
import requests
from bs4
import BeautifulSoup
import csv
page = requests.get('https://www.edmunds.com/ford/escape/2017/cost-to-own/')
soup = BeautifulSoup(page.content, 'html.parser')
data = soup.find_all("ul", {"id": "tco_detail_data"})
Now, to extract the all the lines under class="first", I used:
details = soup.find_all("li", {"class":"first"})
However, it with only get the firs parent li tag and the child li tags under it. How can I repeat the process to select each li class"first" section and write results to CSV? I would appreciate any guidance.
Upvotes: 3
Views: 5107
Reputation: 5215
Here's a similar approach to the previous answer, which will give you the table from the web page in nested list form (i.e. [[table row], [table row], ...'
:
data = soup.find_all("ul", {"id": "tco_detail_data"})
# get all list elements
lis = data[0].find_all('li')
# add a helper lambda, just for readability
find_ul = lambda x: x.find_all('ul')
uls = [find_ul(elem) for elem in lis if find_ul(elem) != []]
# use a nested list comprehension to iterate over the <ul> tags
# and extract text from each <li> into sublists
text = [[li.text.encode('utf-8') for li in ul[0].find_all('li')] for ul in uls]
# [
# ['\xc2\xa0', 'Year 1', 'Year 2', 'Year 3', 'Year 4', 'Year 5', '5 Yr Total'],
# ['Depreciation', '$4,853', '$1,658', '$1,459', '$1,293', '$1,161', '$10,424'],
# ['Taxes & Fees', '$2,057', '$21', '$66', '$21', '$66', '$2,231'],
# ['Financing', '$1,026', '$821', '$605', '$376', '$136', '$2,964'],
# ['Fuel', '$1,606', '$1,654', '$1,704', '$1,755', '$1,808', '$8,527'],
# ['Insurance', '$764', '$791', '$818', '$847', '$877', '$4,097'],
# ['Maintenance', '$230', '$601', '$385', '$1,653', '$1,504', '$4,373'],
# ['Repairs', '$0', '$0', '$109', '$257', '$374', '$740'],
# ['Tax Credit', '$0', '', '', '', '', '$0'],
# ['True Cost to Own \xc2\xae', '$10,536', '$5,546', '$5,146', '$6,202', '$5,926', '$33,356']
# ]
# write "text" list to csv
with open('ford_escape_2017.csv', 'w') as f:
writer = csv.writer(f)
writer.writerows(text)
Upvotes: 3
Reputation: 1357
I'm not sure if the output I have is what you have in mind, as you didn't provide a sample output.
Code:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.edmunds.com/ford/escape/2017/cost-to-own/').text
soup = BeautifulSoup(page, 'html.parser')
uls = soup.find_all('ul', id='tco_detail_data')
for ul in uls:
newsoup = BeautifulSoup(str(ul), 'html.parser')
lis = newsoup.find_all('li')
for li in lis:
print(li.text)
Output:
Year 1
Year 2
Year 3
Year 4
Year 5
5 Yr Total
Year 1
Year 2
Year 3
Year 4
Year 5
5 Yr Total
Depreciation
$5,219
$1,658
$1,459
$1,293
$1,161
$10,790
Depreciation
$5,219
$1,658
$1,459
$1,293
$1,161
$10,790
Taxes & Fees
$2,257
$195
$184
$175
$166
$2,977
Taxes & Fees
$2,257
$195
$184
$175
$166
$2,977
Financing
$1,051
$842
$620
$386
$139
$3,038
Financing
$1,051
$842
$620
$386
$139
$3,038
Fuel
$1,906
$1,963
$2,022
$2,083
$2,146
$10,120
Fuel
$1,906
$1,963
$2,022
$2,083
$2,146
$10,120
Insurance
$1,160
$1,201
$1,243
$1,286
$1,331
$6,221
Insurance
$1,160
$1,201
$1,243
$1,286
$1,331
$6,221
Maintenance
$274
$716
$447
$1,849
$1,637
$4,923
Maintenance
$274
$716
$447
$1,849
$1,637
$4,923
Repairs
$0
$0
$134
$318
$465
$917
Repairs
$0
$0
$134
$318
$465
$917
Tax Credit
$0
$0
Tax Credit
$0
$0
True Cost to Own ®
$11,867
$6,575
$6,109
$7,390
$7,045
$38,986
True Cost to Own ®
$11,867
$6,575
$6,109
$7,390
$7,045
$38,986
To be able to save the results to a csv file, I used cmaher's answer as it helps in creating the csv file. My code is meant just to bring you data for all the text between the li
tags.
Note that I used pipes instead of commas as delimieters for the csv file content, because the data contains commas.
Code:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.edmunds.com/ford/escape/2017/cost-to-own/').text
soup = BeautifulSoup(page, 'html.parser')
data = soup.find_all("ul", {"id": "tco_detail_data"})
lis = data[0].find_all('li')
find_ul = lambda x: x.find_all('ul')
uls = [find_ul(elem) for elem in lis if find_ul(elem) != []]
text = [[li for li in ul[0].find_all('li')] for ul in uls]
with open('csvfile.csv', 'w') as file:
for lis in text:
temp = ''
for li in lis:
temp += li.text + '|'
temp += '\n'
file.write(temp)
Output:
|Year 1|Year 2|Year 3|Year 4|Year 5|5 Yr Total|
Depreciation|$5,219|$1,658|$1,459|$1,293|$1,161|$10,790|
Taxes & Fees|$2,257|$195|$184|$175|$166|$2,977|
Financing|$1,051|$842|$620|$386|$139|$3,038|
Fuel|$1,906|$1,963|$2,022|$2,083|$2,146|$10,120|
Insurance|$1,160|$1,201|$1,243|$1,286|$1,331|$6,221|
Maintenance|$274|$716|$447|$1,849|$1,637|$4,923|
Repairs|$0|$0|$134|$318|$465|$917|
Tax Credit|$0|||||$0|
True Cost to Own ®|$11,867|$6,575|$6,109|$7,390|$7,045|$38,986|
Upvotes: 2