Reputation: 27
I am building my first web-scraper using Python and BS4. I wanted to investigate time-trial data from the 2018 KONA Ironman World Championship. What is the best method for converting JSON to CSV?
from bs4 import BeautifulSoup, Comment
from collections import defaultdict
import json
import requests
sauce =
'http://m.ironman.com/triathlon/events/americas/ironman/world-
championship/results.aspx'
r = requests.get(sauce)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
def parse_table(soup):
result = defaultdict(list)
my_table = soup.find('tbody')
for node in my_table.children:
if isinstance(node, Comment):
# Get content and strip comment "<!--" and "-->"
# Wrap the rows in "table" tags as well.
data = '<table>{}</table>'.format(node[4:-3])
break
table = BeautifulSoup(data, 'html.parser')
for row in table.find_all('tr'):
name, _, swim, bike, run, div_rank, gender_rank,
overall_rank = [col.text.strip() for col in row.find_all('td')[1:]]
result[name].append({
'div_rank': div_rank,
'gender_rank': gender_rank,
'overall_rank': overall_rank,
'swim': swim,
'bike': bike,
'run': run,
})
return result
with open('data.json', 'w') as jsonfile:
json.dump(parse_table(soup), jsonfile)
print(json.dumps(parse_table(soup), indent=3))
JSON output contains the name of the athlete followed by their division, gender, and overall rank as well as swim, bike and run time:
{
"Avila, Anthony 2470": [ {
"div_rank": "138", "gender_rank": "1243", "overall_rank": "1565", "swim": "01:20:11", "bike": "05:27:59", "run": "04:31:56"
}
],
"Lindgren, Mikael 1050": [ {
"div_rank": "151", "gender_rank": "872", "overall_rank": "983", "swim": "01:09:06", "bike": "05:17:51", "run": "03:49:20"
}
],
"Umezawa, Kazuyoshi 1870": [ {
"div_rank": "229", "gender_rank": "1589", "overall_rank": "2186", "swim": "01:17:22", "bike": "06:14:45", "run": "07:16:21"
}
],
"Maric, Bojan 917": [ {
"div_rank": "162", "gender_rank": "923", "overall_rank": "1065", "swim": "01:03:22", "bike": "05:13:56", "run": "04:01:45"
}
],
"Nishioka, Maki 2340": [ {
"div_rank": "6", "gender_rank": "52", "overall_rank": "700", "swim": "00:58:40", "bike": "05:19:10", "run": "03:33:58"
}...
}
Upvotes: 2
Views: 604
Reputation: 28630
There's a few things you could look into. You could work with pandas to do .read_json()
. Or what I did was just iterate over the key, values you had and threw that into a dataframe. Once you have a dataframe, you can just write csv.
from bs4 import BeautifulSoup, Comment
from collections import defaultdict
import json
import requests
import pandas as pd
sauce = 'http://m.ironman.com/triathlon/events/americas/ironman/world-championship/results.aspx'
r = requests.get(sauce)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
def parse_table(soup):
result = defaultdict(list)
my_table = soup.find('tbody')
for node in my_table.children:
if isinstance(node, Comment):
# Get content and strip comment "<!--" and "-->"
# Wrap the rows in "table" tags as well.
data = '<table>{}</table>'.format(node[4:-3])
break
table = BeautifulSoup(data, 'html.parser')
for row in table.find_all('tr'):
name, _, swim, bike, run, div_rank, gender_rank, overall_rank = [col.text.strip() for col in row.find_all('td')[1:]]
result[name].append({
'div_rank': div_rank,
'gender_rank': gender_rank,
'overall_rank': overall_rank,
'swim': swim,
'bike': bike,
'run': run,
})
return result
jsonObj = parse_table(soup)
result = pd.DataFrame()
for k, v in jsonObj.items():
temp_df = pd.DataFrame.from_dict(v)
temp_df['name'] = k
result = result.append(temp_df)
result = result.reset_index(drop=True)
result.to_csv('path/to/filename.csv', index=False)
Output:
print (result)
bike ... name
0 05:27:59 ... Avila, Anthony 2470
1 05:17:51 ... Lindgren, Mikael 1050
2 06:14:45 ... Umezawa, Kazuyoshi 1870
3 05:13:56 ... Maric, Bojan 917
4 05:19:10 ... Nishioka, Maki 2340
5 04:32:26 ... Rana, Ivan 18
6 04:49:08 ... Spalding, Joel 1006
7 04:50:10 ... Samuel, Mark 2479
8 06:45:57 ... Long, Felicia 1226
9 05:24:33 ... Mccarroll, Charles 1355
10 06:36:36 ... Freeman, Roger 154
11 --:--:-- ... Solis, Eduardo 1159
12 04:55:29 ... Schlohmann, Thomas 1696
13 05:39:18 ... Swinson, Renee 1568
14 04:40:41 ... Mechin, Antoine 2226
15 05:23:18 ... Hammond, Serena 1548
16 05:15:10 ... Hassel, Diana 810
17 06:15:59 ... Netto, Laurie-Ann 1559
18 --:--:-- ... Mazur, Maksym 1412
19 07:11:19 ... Weiskopf-Larson, Sue 870
20 05:49:02 ... Sosnowska, Aleksandra 1921
21 06:45:48 ... Wendel, Sandra 262
22 04:39:46 ... Oosterdijk, Tom 2306
23 06:03:01 ... Moss, Julie 358
24 06:24:58 ... Borgio, Alessandro 726
25 05:07:42 ... Newsome, Jason 1058
26 04:44:46 ... Wild, David 2008
27 04:46:06 ... Weitz, Matti 2239
28 04:41:05 ... Gyde, Sam 1288
29 05:27:55 ... Yamauchi, Akio 452
... ... ...
2442 04:38:36 ... Lunn, Paul 916
2443 05:27:27 ... Van Soest, John 1169
2444 06:07:56 ... Austin, John 194
2445 05:20:26 ... Mcgrath, Scott 1131
2446 04:53:27 ... Pike, Chris 1743
2447 05:23:20 ... Ball, Duncan 722
2448 05:33:26 ... Fauske Haferkamp, Cathrine 1222
2449 05:17:34 ... Vocking, Peter 641
2450 05:15:30 ... Temme, Travis 1010
2451 07:14:14 ... Sun, Shiyi 2342
2452 04:52:14 ... Li, Peng Cheng 2232
2453 06:26:26 ... Lloyd, Graham 148
2454 04:44:42 ... Bartelle, Daniel 1441
2455 04:51:58 ... Overmars, Koen Pieter 1502
2456 05:23:24 ... Muroya, Koji 439
2457 05:45:42 ... Brown, Ani De Leon 1579
2458 06:42:16 ... Peters, Nancy 370
2459 06:43:07 ... Albarote, Lezette 1575
2460 04:50:45 ... Mohr, Robert 1990
2461 07:17:40 ... Hirose, Roen 497
2462 05:12:10 ... Girardi, Bruno 312
2463 04:59:44 ... Cowan, Anthony 966
2464 06:03:59 ... Hoskens, Rudy 433
2465 04:32:20 ... Baker, Edward 1798
2466 05:11:07 ... Svetana, Ushakova 2343
2467 05:56:06 ... Peterson, Greg 779
2468 05:22:15 ... Wallace, Patrick 287
2469 05:53:14 ... Lott, Jon 914
2470 05:00:29 ... Goodlad, Martin 977
2471 04:34:59 ... Maley, Joel 1840
[2472 rows x 7 columns]
ADDITIONAL
Just also to point out, the data is already returned to you as a json structure. The only difference is you'd need to work out the query parameters to iterate over the pages, which is much slower than your code, so there is a trade off. Unless you look into the query parameters to return all 2460 results versus 30 at a time/per page. But that is also an option to get that json structure.
But you can take the json structure and normalize it to a dataframe, then save as csv.
import requests
from pandas.io.json import json_normalize
import pandas as pd
request_url = 'http://m.ironman.com/Handlers/EventLiveResultsMobile.aspx'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
page = ''
params = {
'year': '2018',
'race': 'worldchampionship',
'q': '',
'p': page,
'so': 'orank',
'sd': ''}
response = requests.get(request_url, headers=headers, params=params)
jsonObj = response.json()
lastPage = jsonObj['lastPage']
result = pd.DataFrame()
for page in range(1, lastPage):
page = str(page)
print ('Processed Page: '+ page)
response = requests.get(request_url, headers=headers, params=params)
jsonObj = response.json()
temp_df = json_normalize(jsonObj['records'])
result = result.append(temp_df)
result = result.reset_index(drop=True)
result.to_csv('path/to/filename.csv', index=False)
Upvotes: 2