sumit shelar
sumit shelar

Reputation: 37

How to Insert following beautiful soup scraped data in excel?

from bs4 import BeautifulSoup
import pandas as pd
import requests
import time
from datetime import datetime

def extract_source(url):
     agent = {"User-Agent":"Mozilla/5.0"}
     source=requests.get(url, headers=agent).text
     return source

html_text = extract_source('https://www.mpbio.com/us/life-sciences/biochemicals/amino-acids')
soup = BeautifulSoup(html_text, 'lxml')

for a in soup.find_all('a', class_ = 'button button--link button--fluid catalog-list-item__actions-primary-button', href=True):
    # print ("Found the URL:", a['href'])
    urlof = a['href']
    html_text = extract_source(urlof)
    soup = BeautifulSoup(html_text, 'lxml') 
        
    table_rows = soup.find_all('tr')

    first_columns = []
    third_columns = []
    for row in table_rows:
#         for row in table_rows[1:]:
        first_columns.append(row.findAll('td')[0])
        third_columns.append(row.findAll('td')[1])

    for first, third in zip(first_columns, third_columns):
        print(first.text, third.text)

Basically I am trying to scrape data from tables from multiple links of Website. And I want to insert that data in one excel csv file in following table format

SKU                           07DE9922
Analyte / Target           Corticosterone
Base Catalog Number        DE9922
Diagnostic Platforms       EIA/ELISA
Diagnostic Solutions       Endocrinology
Disease Screened           Corticosterone
Evaluation                 Quantitative
Pack Size                  96 Wells
Sample Type                Plasma, Serum
Sample Volume              10 uL
Species Reactivity         Mouse, Rat
Usage Statement            For Research Use Only, not for use in diagnostic procedures.

To below format in excel file

SKU  Analyte/Target  Base Catalog Number Pack Size Sample Type
data   data           data                 data      data

I am facing difficulties while converting data in proper format

Upvotes: 1

Views: 123

Answers (2)

Nk03
Nk03

Reputation: 14949

Try this -

import pandas as pd
import requests
import time
from datetime import datetime
from bs4 import BeautifulSoup


def extract_source(url):
    agent = {"User-Agent": "Mozilla/5.0"}
    return requests.get(url, headers=agent).text


html_text = extract_source(
    'https://www.mpbio.com/us/life-sciences/biochemicals/amino-acids')
soup = BeautifulSoup(html_text, 'lxml')


result = []
for index, a in enumerate(soup.find_all('a', class_='button button--link button--fluid catalog-list-item__actions-primary-button', href=True)):
    #  if index >= 10:
    #       break
    # print ("Found the URL:", a['href'])
    urlof = a['href']
    html_text = extract_source(urlof)
    soup = BeautifulSoup(html_text, 'lxml')

    table_rows = soup.find_all('tr')

    first_columns = []
    third_columns = []
    for row in table_rows:
        #         for row in table_rows[1:]:
        first_columns.append(row.findAll('td')[0])
        third_columns.append(row.findAll('td')[1])

    temp = {}
    for first, third in zip(first_columns, third_columns):
        third = str(third.text).strip('\n')
        first = str(first.text).strip('\n')
        temp[first] = third

    result.append(temp)
df = pd.DataFrame(result)
# please drop useless column before saving the output to csv.
df.to_csv('out.csv', index=False)

This will give output -

SKU Alternate Names Base Catalog Number CAS # EC Number Format Molecular Formula Molecular Weight Personal Protective Equipment Usage Statement Application Notes Beilstein Registry Number Optical Rotation Purity UV Visible Absorbance Hazard Statements RTECS Number Safety Symbol
02100078-CF 2-Acetamido-5-Guanidinovaleric acid 100078 210545-23-6 205-846-6 Powder C8H16N4O3· 2H2O 216.241 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's products are for research or further manufacturing use only, not for direct human use. For more information, please contact our customer service department.
02100142-CF Acetyltryptophan; DL-α-Acetylamino-3-indolepropionic acid 100142 87-32-1 201-739-3 Powder C13H14N2O3 246.266 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's products are for research or further manufacturing use only, not for direct human use. For more information, please contact our customer service department. N-acetyl-DL-tryptophan, is used as stabilizer in the human blood-derived therapeutic products normal serum albumin and plasma protein fraction. 89478 0° ± 2° (c=1, 1N NaOH, 24 hrs.) ~99% λ max (water)=280 ± 2 nm
02100421-CF L-2,5-Diaminopentanoic acid; 2,5-Diaminopentanoic acid monohydrochloride 100421 3184-13-2 221-678-6 Powder C5H12N2O2 • HCl 168.621 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's products are for research or further manufacturing use only, not for direct human use. For more information, please contact our customer service department. 3625847 ~99% H319 RM2985000 GHS07

Upvotes: 1

Andrej Kesely
Andrej Kesely

Reputation: 195408

I made small modification to your code. Instead of printing the data, I created a dictionary and added it to list. Then I used this list to create a DataFrame:

import pandas as pd
import requests
import time
from datetime import datetime


def extract_source(url):
    agent = {"User-Agent": "Mozilla/5.0"}
    source = requests.get(url, headers=agent).text
    return source


html_text = extract_source(
    "https://www.mpbio.com/us/life-sciences/biochemicals/amino-acids"
)
soup = BeautifulSoup(html_text, "lxml")

data = []
for a in soup.find_all(
    "a",
    class_="button button--link button--fluid catalog-list-item__actions-primary-button",
    href=True,
):
    urlof = a["href"]
    html_text = extract_source(urlof)
    soup = BeautifulSoup(html_text, "lxml")

    table_rows = soup.find_all("tr")

    first_columns = []
    third_columns = []
    for row in table_rows:
        first_columns.append(row.findAll("td")[0])
        third_columns.append(row.findAll("td")[1])

    # create dictionary with values and add to the list
    d = {}
    for first, third in zip(first_columns, third_columns):
        d[first.get_text(strip=True)] = third.get_text(strip=True)
    data.append(d)

df = pd.DataFrame(data)
print(df)
df.to_csv("data.csv", index=False)

Prints:

           SKU                                    Alternate Names Base Catalog Number        CAS #  EC Number  Format  Molecular Formula Molecular Weight           Personal Protective Equipment                                    Usage Statement                                  Application Notes Beilstein Registry Number                 Optical Rotation Purity     UV Visible Absorbance Hazard Statements RTECS Number Safety Symbol         Auto Ignition                  Biochemical Physiological Actions                    Density                            Melting Point                                          pH                              pKa                                         Solubility                      Vapor Pressure      Grade                                      Boiling Point Isoelectric Point
0  02100078-CF                2-Acetamido-5-Guanidinovaleric acid              100078  210545-23-6  205-846-6  Powder    C8H16N4O3· 2H2O    216.241 g/mol   Eyeshields, Gloves, respirator filter  Unless specified otherwise, MP Biomedical's pr...                                                NaN                       NaN                              NaN    NaN                       NaN               NaN          NaN           NaN                   NaN                                                NaN                        NaN                                      NaN                                         NaN                              NaN                                                NaN                                 NaN        NaN                                                NaN               NaN
1  02100142-CF  Acetyltryptophan; DL-α-Acetylamino-3-indolepro...              100142      87-32-1  201-739-3  Powder         C13H14N2O3    246.266 g/mol   Eyeshields, Gloves, respirator filter  Unless specified otherwise, MP Biomedical's pr...  N-acetyl-DL-tryptophan, is used as stabilizer ...                     89478  0° ± 2° (c=1, 1N NaOH, 24 hrs.)   ~99%  λ max (water)=280 ± 2 nm               NaN          NaN           NaN                   NaN                                                NaN                        NaN                                      NaN                                         NaN                              NaN                                                NaN                                 NaN        NaN                                                NaN               NaN
2  02100421-CF  L-2,5-Diaminopentanoic acid; 2,5-Diaminopentan...              100421    3184-13-2  221-678-6  Powder    C5H12N2O2 • HCl    168.621 g/mol   Eyeshields, Gloves, respirator filter  Unless specified otherwise, MP Biomedical's pr...                                                NaN                   3625847                              NaN   ~99%                       NaN              H319    RM2985000         GHS07                   NaN                                                NaN                        NaN                                      NaN                                         NaN                              NaN                                                NaN                                 NaN        NaN                                                NaN               NaN
3  02100520-CF  Phosphocreatine Disodium Salt Tetrahydrate; So...              100520     922-32-7  213-074-6  Powder  C4H8N3Na2O5P·4H2O    255.077 g/mol   Eyeshields, Gloves, respirator filter  Unless specified otherwise, MP Biomedical's pr...                                                NaN                       NaN                              NaN   ≥98%                       NaN               NaN          NaN           NaN                   NaN                                                NaN                        NaN                                      NaN                                         NaN                              NaN                                                NaN                                 NaN        NaN                                                NaN               NaN
4  02100769-CF  Vitamin C; Ascorbate; Sodium ascorbate; L-Xylo...              100769      50-81-7  200-066-2     NaN             C6H8O6    176.124 g/mol   Eyeshields, Gloves, respirator filter  Unless specified otherwise, MP Biomedical's pr...  L-Ascorbic Acid is used as an Antimicrobial an...                     84272        +18° to +32° (c=1, water)   ≥98%                       NaN               NaN    CI7650000           NaN  1220° F (NTP, 1992)  Ascorbic Acid, also known as Vitamin C, is a s...           1.65 (NTP, 1992)  374 to 378° F (decomposes) (NTP, 1992)  Between 2,4 and 2,8 (2 % aqueous solution)            pK1: 4.17; pK2: 11.57  greater than or equal to 100 mg/mL at 73° F (...  9.28X10-11 mm Hg at 25 deg C (est)        NaN                                                NaN               NaN
5  02101003-CF  Lycine; Oxyneurine; (Carboxymethyl)trimethylam...              101003     107-43-7  203-490-6  Powder           C5H11NO2    117.148 g/mol   Eyeshields, Gloves, respirator filter  Unless specified otherwise, MP Biomedical's pr...  Betaine is a reagent that is used in soldering...                   3537113                              NaN    NaN                       NaN               NaN    DS5900000           NaN                   NaN  End-product of oxidative metabolism of choline...                        NaN              Decomposes around 293 deg C                                         NaN                      1.83 (Lit.)  Solubility (g/100 g solvent): <a class="pubche...   1.36X10-8 mm Hg at 25 deg C (est)  Anhydrous                                                NaN               NaN
6  02101806-CF  (S)-2,5-Diamino-5-oxopentanoic acid; L-Glutami...              101806      56-85-9  200-292-1     NaN          C5H10N2O3    146.146 g/mol  Eyeshields, Gloves,  respirator filter  Unless specified otherwise, MP Biomedical's pr...  L-glutamine is an essential amino acid, which ...                   1723797       +30 ± 5° (c = 3.5, 1N HCl)   ≥99%                       NaN               NaN    MA2275100           NaN                   NaN  L-Glutamine is an essential amino acid that is...              1.364 g/cu cm                             185.5 dec °C            pH = 5-6 at 14.6 g/L at 25 deg C                              NaN              Water Solubility41300 mg/L (at 25 °C)    1.9X10-8 mm Hg at 25 deg C (est)        NaN                                                NaN               NaN
7  02102158-CF  L-2-Amino-4-methylpentanoic acid; Leu; L; α-am...              102158      61-90-5  200-522-0  Powder           C6H13NO2    131.175 g/mol   Eyeshields, Gloves, respirator filter  Unless specified otherwise, MP Biomedical's pr...  Leucine has been used as a molecular marker in...                   1721722        +14.5 ° to +16.5 ° (Lit.)    NaN                       NaN               NaN    OH2850000           NaN                   NaN                                                NaN  1.293 g/cu cm at 18 deg C                                   293 °C                                         NaN  2.33 (-COOH), 9.74 (-NH2)(Lit.)              Water Solubility21500 mg/L (at 25 °C)   5.52X10-9 mm Hg at 25 deg C (est)        NaN  Sublimes at 145-148 deg C. Decomposes at 293-2...        6.04(Lit.)
8  02102576-CF  4-Hydroxycinnamic acid; 3-(4-Hydroxphenyl)-2-p...              102576    7400-08-0  231-000-0  Powder             C9H8O3     164.16 g/mol           Dust mask, Eyeshields, Gloves  Unless specified otherwise, MP Biomedical's pr...  p-Coumaric acid was used as a substrate to stu...                       NaN                              NaN   ≥98%                       NaN               NaN    GD9094000           NaN                   NaN                                                NaN                        NaN                                 211.5 °C                                         NaN                              NaN                                                NaN                                 NaN        NaN                                                NaN               NaN
9  02102868-CF  DL-2-Amino-3-hydroxypropionic acid; (±)-2-Amin...              102868     302-84-1  206-130-6  Powder            C3H7NO3    105.093 g/mol   Eyeshields, Gloves, respirator filter  Unless specified otherwise, MP Biomedical's pr...                                                NaN                   1721405      -1° to + 1° (c = 5, 1N HCl)   ≥98%                       NaN               NaN          NaN           NaN                   NaN  NMDA agonist acting at the glycine site; precu...     1.6 g/cu cm @ 22 deg C                   228 deg C (decomposes)                                         NaN                              NaN  SOL IN <a class="pubchem-internal-link CID-962...                                 NaN        NaN                                                NaN               NaN

...and so on.

And saves data.csv (screenshot from LibreOffice):

enter image description here

Upvotes: 3

Related Questions