Reputation: 37
from bs4 import BeautifulSoup
import pandas as pd
import requests
import time
from datetime import datetime
def extract_source(url):
agent = {"User-Agent":"Mozilla/5.0"}
source=requests.get(url, headers=agent).text
return source
html_text = extract_source('https://www.mpbio.com/us/life-sciences/biochemicals/amino-acids')
soup = BeautifulSoup(html_text, 'lxml')
for a in soup.find_all('a', class_ = 'button button--link button--fluid catalog-list-item__actions-primary-button', href=True):
# print ("Found the URL:", a['href'])
urlof = a['href']
html_text = extract_source(urlof)
soup = BeautifulSoup(html_text, 'lxml')
table_rows = soup.find_all('tr')
first_columns = []
third_columns = []
for row in table_rows:
# for row in table_rows[1:]:
first_columns.append(row.findAll('td')[0])
third_columns.append(row.findAll('td')[1])
for first, third in zip(first_columns, third_columns):
print(first.text, third.text)
Basically I am trying to scrape data from tables from multiple links of Website. And I want to insert that data in one excel csv file in following table format
SKU 07DE9922 Analyte / Target Corticosterone Base Catalog Number DE9922 Diagnostic Platforms EIA/ELISA Diagnostic Solutions Endocrinology Disease Screened Corticosterone Evaluation Quantitative Pack Size 96 Wells Sample Type Plasma, Serum Sample Volume 10 uL Species Reactivity Mouse, Rat Usage Statement For Research Use Only, not for use in diagnostic procedures.
To below format in excel file
SKU Analyte/Target Base Catalog Number Pack Size Sample Type data data data data data
I am facing difficulties while converting data in proper format
Upvotes: 1
Views: 123
Reputation: 14949
Try this -
import pandas as pd
import requests
import time
from datetime import datetime
from bs4 import BeautifulSoup
def extract_source(url):
agent = {"User-Agent": "Mozilla/5.0"}
return requests.get(url, headers=agent).text
html_text = extract_source(
'https://www.mpbio.com/us/life-sciences/biochemicals/amino-acids')
soup = BeautifulSoup(html_text, 'lxml')
result = []
for index, a in enumerate(soup.find_all('a', class_='button button--link button--fluid catalog-list-item__actions-primary-button', href=True)):
# if index >= 10:
# break
# print ("Found the URL:", a['href'])
urlof = a['href']
html_text = extract_source(urlof)
soup = BeautifulSoup(html_text, 'lxml')
table_rows = soup.find_all('tr')
first_columns = []
third_columns = []
for row in table_rows:
# for row in table_rows[1:]:
first_columns.append(row.findAll('td')[0])
third_columns.append(row.findAll('td')[1])
temp = {}
for first, third in zip(first_columns, third_columns):
third = str(third.text).strip('\n')
first = str(first.text).strip('\n')
temp[first] = third
result.append(temp)
df = pd.DataFrame(result)
# please drop useless column before saving the output to csv.
df.to_csv('out.csv', index=False)
This will give output -
SKU | Alternate Names | Base Catalog Number | CAS # | EC Number | Format | Molecular Formula | Molecular Weight | Personal Protective Equipment | Usage Statement | Application Notes | Beilstein Registry Number | Optical Rotation | Purity | UV Visible Absorbance | Hazard Statements | RTECS Number | Safety Symbol |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
02100078-CF | 2-Acetamido-5-Guanidinovaleric acid | 100078 | 210545-23-6 | 205-846-6 | Powder | C8H16N4O3· 2H2O | 216.241 g/mol | Eyeshields, Gloves, respirator filter | Unless specified otherwise, MP Biomedical's products are for research or further manufacturing use only, not for direct human use. For more information, please contact our customer service department. | ||||||||
02100142-CF | Acetyltryptophan; DL-α-Acetylamino-3-indolepropionic acid | 100142 | 87-32-1 | 201-739-3 | Powder | C13H14N2O3 | 246.266 g/mol | Eyeshields, Gloves, respirator filter | Unless specified otherwise, MP Biomedical's products are for research or further manufacturing use only, not for direct human use. For more information, please contact our customer service department. | N-acetyl-DL-tryptophan, is used as stabilizer in the human blood-derived therapeutic products normal serum albumin and plasma protein fraction. | 89478 | 0° ± 2° (c=1, 1N NaOH, 24 hrs.) | ~99% | λ max (water)=280 ± 2 nm | |||
02100421-CF | L-2,5-Diaminopentanoic acid; 2,5-Diaminopentanoic acid monohydrochloride | 100421 | 3184-13-2 | 221-678-6 | Powder | C5H12N2O2 • HCl | 168.621 g/mol | Eyeshields, Gloves, respirator filter | Unless specified otherwise, MP Biomedical's products are for research or further manufacturing use only, not for direct human use. For more information, please contact our customer service department. | 3625847 | ~99% | H319 | RM2985000 | GHS07 |
Upvotes: 1
Reputation: 195408
I made small modification to your code. Instead of printing the data, I created a dictionary and added it to list. Then I used this list to create a DataFrame:
import pandas as pd
import requests
import time
from datetime import datetime
def extract_source(url):
agent = {"User-Agent": "Mozilla/5.0"}
source = requests.get(url, headers=agent).text
return source
html_text = extract_source(
"https://www.mpbio.com/us/life-sciences/biochemicals/amino-acids"
)
soup = BeautifulSoup(html_text, "lxml")
data = []
for a in soup.find_all(
"a",
class_="button button--link button--fluid catalog-list-item__actions-primary-button",
href=True,
):
urlof = a["href"]
html_text = extract_source(urlof)
soup = BeautifulSoup(html_text, "lxml")
table_rows = soup.find_all("tr")
first_columns = []
third_columns = []
for row in table_rows:
first_columns.append(row.findAll("td")[0])
third_columns.append(row.findAll("td")[1])
# create dictionary with values and add to the list
d = {}
for first, third in zip(first_columns, third_columns):
d[first.get_text(strip=True)] = third.get_text(strip=True)
data.append(d)
df = pd.DataFrame(data)
print(df)
df.to_csv("data.csv", index=False)
Prints:
SKU Alternate Names Base Catalog Number CAS # EC Number Format Molecular Formula Molecular Weight Personal Protective Equipment Usage Statement Application Notes Beilstein Registry Number Optical Rotation Purity UV Visible Absorbance Hazard Statements RTECS Number Safety Symbol Auto Ignition Biochemical Physiological Actions Density Melting Point pH pKa Solubility Vapor Pressure Grade Boiling Point Isoelectric Point
0 02100078-CF 2-Acetamido-5-Guanidinovaleric acid 100078 210545-23-6 205-846-6 Powder C8H16N4O3· 2H2O 216.241 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's pr... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 02100142-CF Acetyltryptophan; DL-α-Acetylamino-3-indolepro... 100142 87-32-1 201-739-3 Powder C13H14N2O3 246.266 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's pr... N-acetyl-DL-tryptophan, is used as stabilizer ... 89478 0° ± 2° (c=1, 1N NaOH, 24 hrs.) ~99% λ max (water)=280 ± 2 nm NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 02100421-CF L-2,5-Diaminopentanoic acid; 2,5-Diaminopentan... 100421 3184-13-2 221-678-6 Powder C5H12N2O2 • HCl 168.621 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's pr... NaN 3625847 NaN ~99% NaN H319 RM2985000 GHS07 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 02100520-CF Phosphocreatine Disodium Salt Tetrahydrate; So... 100520 922-32-7 213-074-6 Powder C4H8N3Na2O5P·4H2O 255.077 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's pr... NaN NaN NaN ≥98% NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 02100769-CF Vitamin C; Ascorbate; Sodium ascorbate; L-Xylo... 100769 50-81-7 200-066-2 NaN C6H8O6 176.124 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's pr... L-Ascorbic Acid is used as an Antimicrobial an... 84272 +18° to +32° (c=1, water) ≥98% NaN NaN CI7650000 NaN 1220° F (NTP, 1992) Ascorbic Acid, also known as Vitamin C, is a s... 1.65 (NTP, 1992) 374 to 378° F (decomposes) (NTP, 1992) Between 2,4 and 2,8 (2 % aqueous solution) pK1: 4.17; pK2: 11.57 greater than or equal to 100 mg/mL at 73° F (... 9.28X10-11 mm Hg at 25 deg C (est) NaN NaN NaN
5 02101003-CF Lycine; Oxyneurine; (Carboxymethyl)trimethylam... 101003 107-43-7 203-490-6 Powder C5H11NO2 117.148 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's pr... Betaine is a reagent that is used in soldering... 3537113 NaN NaN NaN NaN DS5900000 NaN NaN End-product of oxidative metabolism of choline... NaN Decomposes around 293 deg C NaN 1.83 (Lit.) Solubility (g/100 g solvent): <a class="pubche... 1.36X10-8 mm Hg at 25 deg C (est) Anhydrous NaN NaN
6 02101806-CF (S)-2,5-Diamino-5-oxopentanoic acid; L-Glutami... 101806 56-85-9 200-292-1 NaN C5H10N2O3 146.146 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's pr... L-glutamine is an essential amino acid, which ... 1723797 +30 ± 5° (c = 3.5, 1N HCl) ≥99% NaN NaN MA2275100 NaN NaN L-Glutamine is an essential amino acid that is... 1.364 g/cu cm 185.5 dec °C pH = 5-6 at 14.6 g/L at 25 deg C NaN Water Solubility41300 mg/L (at 25 °C) 1.9X10-8 mm Hg at 25 deg C (est) NaN NaN NaN
7 02102158-CF L-2-Amino-4-methylpentanoic acid; Leu; L; α-am... 102158 61-90-5 200-522-0 Powder C6H13NO2 131.175 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's pr... Leucine has been used as a molecular marker in... 1721722 +14.5 ° to +16.5 ° (Lit.) NaN NaN NaN OH2850000 NaN NaN NaN 1.293 g/cu cm at 18 deg C 293 °C NaN 2.33 (-COOH), 9.74 (-NH2)(Lit.) Water Solubility21500 mg/L (at 25 °C) 5.52X10-9 mm Hg at 25 deg C (est) NaN Sublimes at 145-148 deg C. Decomposes at 293-2... 6.04(Lit.)
8 02102576-CF 4-Hydroxycinnamic acid; 3-(4-Hydroxphenyl)-2-p... 102576 7400-08-0 231-000-0 Powder C9H8O3 164.16 g/mol Dust mask, Eyeshields, Gloves Unless specified otherwise, MP Biomedical's pr... p-Coumaric acid was used as a substrate to stu... NaN NaN ≥98% NaN NaN GD9094000 NaN NaN NaN NaN 211.5 °C NaN NaN NaN NaN NaN NaN NaN
9 02102868-CF DL-2-Amino-3-hydroxypropionic acid; (±)-2-Amin... 102868 302-84-1 206-130-6 Powder C3H7NO3 105.093 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's pr... NaN 1721405 -1° to + 1° (c = 5, 1N HCl) ≥98% NaN NaN NaN NaN NaN NMDA agonist acting at the glycine site; precu... 1.6 g/cu cm @ 22 deg C 228 deg C (decomposes) NaN NaN SOL IN <a class="pubchem-internal-link CID-962... NaN NaN NaN NaN
...and so on.
And saves data.csv
(screenshot from LibreOffice):
Upvotes: 3