Reputation: 47
The NIST dataset website contains some data of copper, how can I grab the table in the left (titled “HTML table format “) from the website using a script of python. And only perverse the numbers in the second and third columns as shown in picture below. And store all data into a .csv file. I tried codes below, but it failed to get the correct format of the table.
import pandas as pd
# URL of the table
url = "https://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z29.html"
# Read the table into a pandas dataframe
df = pd.read_html(url, header=0, index_col=0)[0]
# Save the processed table to a CSV file
df.to_csv("nist_table.csv", index=False)
Upvotes: 1
Views: 2133
Reputation: 25048
You could use:
.droplevel([0,1])
to remove the unwanted header rows.dropna(axis=1, how='all')
to remove the empty columns.iloc[:,1:]
to select only specific 3 columnsimport pandas as pd
url = "https://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z29.html"
df = pd.read_html(url, header=[0,1,2,3])[1].droplevel([0,1], axis=1).dropna(axis=1, how='all').iloc[:,1:]
df
Upvotes: 2
Reputation: 19
For parsing HTML documents BeautifulSoup is a great Python package to use, this with the requests library you can extract the data you want.
The code below should extract the desired data:
# import packages/libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd
# define URL link variable, get the response and parse the HTML dom contents
url = "https://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z29.html"
response = requests.get(url).text
soup = BeautifulSoup(response, 'html.parser')
# declare table variable and use soup to find table in HTML dom
table = soup.find('table')
# iterate over table rows (tr) and append table data (td) to rows list
rows = []
for i, row in enumerate(table.find_all('tr')):
# only append data if its after 3rd row -> (MeV),(cm2/g),(cm2/g)
if i > 3:
rows.append([value.text.strip() for value in row.find_all('td')])
# create DataFrame from the data appended to the rows list
df = pd.DataFrame(rows)
# export data to csv file called datafile
df.to_csv(r"datafile.csv")
Upvotes: 0