Reputation: 51
I have been trying to parse a table from the url http://podaac.jpl.nasa.gov/ws/search/granule/index.html but the table is so huge that it takes some milliseconds to load the table after loading the website. Since the beautiful soup takes the first instance returned by the website it can't load the complete table but only the headings of the table.
from bs4 import BeautifulSoup as bs
import requests
datasetIds = []
html = requests.get('http://podaac.jpl.nasa.gov/ws/search/granule/index.html')
soup = bs(html.text, 'html.parser')
table = soup.find("table", {"id": "tblDataset"})
print table
rows = table.find_all('tr')
rows.remove(rows[0])
for row in rows:
x = row.find_all('td')
datasetIds.append(x[1].text.encode('utf-8'))
print datasetIds
The code must return the datasetIds in the first table but it just returns the headings of the table. Thanks in advance for help! :)
Upvotes: 3
Views: 1054
Reputation: 180401
The data is retrieved with an ajax request, you can do a get from that which returns nicely formatted json:
json = requests.get("http://podaac.jpl.nasa.gov/dmasSolr/solr/dataset/select/?q=*:*&fl=Dataset-PersistentId,Dataset-ShortName-Full&rows=2147483647&fq=DatasetPolicy-AccessType-Full:(OPEN+OR+PREVIEW+OR+SIMULATED+OR+REMOTE)+AND+DatasetPolicy-ViewOnline:Y&wt=json").json()
print(json)
We just need to pull using a couple of keys:
from pprint import pprint as pp
pp(json["response"]["docs"])
A snippet of the output:
[{'Dataset-PersistentId': 'PODAAC-MODST-M8D9N',
'Dataset-ShortName-Full': 'MODIS_TERRA_L3_SST_MID-IR_8DAY_9KM_NIGHTTIME'},
{'Dataset-PersistentId': 'PODAAC-MODST-MAN4N',
'Dataset-ShortName-Full': 'MODIS_TERRA_L3_SST_MID-IR_ANNUAL_4KM_NIGHTTIME'},
{'Dataset-PersistentId': 'PODAAC-MODSA-MMO9N',
'Dataset-ShortName-Full': 'MODIS_AQUA_L3_SST_MID-IR_MONTHLY_9KM_NIGHTTIME'},
{'Dataset-PersistentId': 'PODAAC-MODST-M1D9N',
'Dataset-ShortName-Full': 'MODIS_TERRA_L3_SST_MID-IR_DAILY_9KM_NIGHTTIME'},
{'Dataset-PersistentId': 'PODAAC-GHMTG-2PN01',
'Dataset-ShortName-Full': 'NAVO-L2P-AVHRRMTA_G'},
{'Dataset-PersistentId': 'PODAAC-GHBDM-4FD01',
'Dataset-ShortName-Full': 'DMI-L4UHfnd-NSEABALTIC-DMI_OI'},
{'Dataset-PersistentId': 'PODAAC-GHGOY-4FE01',
'Dataset-ShortName-Full': 'EUR-L4HRfnd-GLOB-ODYSSEA'},
{'Dataset-PersistentId': 'PODAAC-GHMED-4FE01',
'Dataset-ShortName-Full': 'EUR-L4UHFnd-MED-v01'},
{'Dataset-PersistentId': 'PODAAC-NSGDR-L2X02',
'Dataset-ShortName-Full': 'NSCAT_LEVEL_2_V2'},
{'Dataset-PersistentId': 'PODAAC-MODST-M1D4N',
'Dataset-ShortName-Full': 'MODIS_TERRA_L3_SST_MID-IR_DAILY_4KM_NIGHTTIME'},
{'Dataset-PersistentId': 'PODAAC-MODSA-MMO4N',
'Dataset-ShortName-Full': 'MODIS_AQUA_L3_SST_MID-IR_MONTHLY_4KM_NIGHTTIME'},
{'Dataset-PersistentId': 'PODAAC-MODST-MMO4N',
'Dataset-ShortName-Full': 'MODIS_TERRA_L3_SST_MID-IR_MONTHLY_4KM_NIGHTTIME'},
{'Dataset-PersistentId': 'PODAAC-MODSA-MAN9N',
'Dataset-ShortName-Full': 'MODIS_AQUA_L3_SST_MID-IR_ANNUAL_9KM_NIGHTTIME'},
{'Dataset-PersistentId': 'PODAAC-MODSA-M8D4N',
'Dataset-ShortName-Full': 'MODIS_AQUA_L3_SST_MID-IR_8DAY_4KM_NIGHTTIME'},
{'Dataset-PersistentId': 'PODAAC-MODSA-M1D4N',
'Dataset-ShortName-Full': 'MODIS_AQUA_L3_SST_MID-IR_DAILY_4KM_NIGHTTIME'},
{'Dataset-PersistentId': 'PODAAC-GOES3-24HOR',
'Dataset-ShortName-Full': 'GOES_L3_SST_6km_NRT_SST_24HOUR'},
That gives you all the pairs of Dataset ID and Short Name from the table without needing bs4 at all.
To get the ids you just access each dict using the key Dataset-PersistentId
:
for d in json["response"]["docs"]:
print("ID for {Dataset-ShortName-Full} is {Dataset-PersistentId}".format(**d) )
Some output:
ID for OSTM_L2_OST_OGDR_GPS is PODAAC-J2ODR-GPS00
ID for JPL-L4UHblend-NCAMERICA-RTO_SST_Ad is PODAAC-GHRAD-4FJ01
ID for SEAWINDS_BYU_L3_OW_SIGMA0_ANTARCTICA_POLAR-STEREOGRAPHIC_BROWSE_IMAGES is PODAAC-SEABY-ANBIM
ID for SEAWINDS_BYU_L3_OW_SIGMA0_ANTARCTICA_POLAR-STEREOGRAPHIC_BROWSE_MAPS_LITE is PODAAC-SEABY-ANBML
ID for CCMP_MEASURES_ATLAS_L4_OW_L3_5A_5DAY_WIND_VECTORS_FLK is PODAAC-CCF35-01AD5
ID for QSCAT_BYU_L3_OW_SIGMA0_ARCTIC_POLAR-STEREOGRAPHIC_BROWSE_MAPS_LITE is PODAAC-QSBYU-ARBML
ID for MODIS_AQUA_L3_SST_MID-IR_ANNUAL_4KM_NIGHTTIME is PODAAC-MODSA-MAN4N
ID for UCLA_DEALIASED_SASS_L3 is PODAAC-SASSX-L3UCD
ID for NSCAT_LEVEL_1.7_V2 is PODAAC-NSSDR-17X02
ID for NSCAT_LEVEL_3_V2 is PODAAC-NSJPL-L3X02
ID for AVHRR_NAVOCEANO_L3_18km_MCSST_DAYTIME is PODAAC-NAVOC-318DY
ID for QSCAT_L3_OW_JPL_BROWSE_IMAGES is PODAAC-QSXXX-L3BI0
ID for QSCAT_BYU_L3_OW_SIGMA0_ANTARCTICA_POLAR-STEREOGRAPHIC_BROWSE_IMAGES is PODAAC-QSBYU-ANBIM
ID for NAVO-L4HR1m-GLOB-K10_SST is PODAAC-GHK10-41N01
ID for NCDC-L4LRblend-GLOB-AVHRR_AMSR_OI is PODAAC-GHAOI-4BC01
ID for SEAWINDS_LEVEL_3_V2 is PODAAC-SEAXX-L3X02
There is second ajax request that returns further data:
json = requests.get("http://podaac.jpl.nasa.gov/dmasSolr/solr/granule/select/?q=*&fq=Granule-AccessType:(OPEN+OR+PREVIEW+OR+SIMULATED+OR+REMOTE)+AND+Granule-Status:ONLINE&facet=true&facet.field=Dataset-ShortName-Full&rows=0&facet.limit=-1&facet.mincount=1&wt=json").json()
from pprint import pprint as pp
pp(json)
You can also change some parameters to give you different output.
Upvotes: 1
Reputation: 2188
As @Hassan Mehmood mentioned, you have to use selenium (or any other headles browser) for this, because the table is generated with javascript. beautifulsoup does not evaluate javascript and can not be used to get the desired data.
You can use this as a staring point:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import logging
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
log = logging.getLogger(__name__)
logging.getLogger('selenium').setLevel(logging.WARNING)
logging.getLogger('requests').setLevel(logging.WARNING)
def test():
url = 'http://podaac.jpl.nasa.gov/ws/search/granule/index.html'
wait_for_element = 30
s = webdriver.PhantomJS()
s.set_window_size(1274, 826)
s.set_page_load_timeout(45)
s.get(url)
WebDriverWait(s, wait_for_element).until(
EC.presence_of_element_located((By.CLASS_NAME, "detailTABLE")))
datasets = s.find_elements_by_class_name("detailTABLE")
for item in datasets:
print item.text
if __name__ == '__main__':
test()
Upvotes: 0