S.EB
S.EB

Reputation: 2226

How to get the information of specific 'li' tag in web scraping (e.g., second <li> tag)?

I am trying to write a web scraping code for this link, which has around 300 pages. There are some links (records) on each page that the data needs to be taken from the page of that record. For example, the DRAMP00005, has different fields, that I need to extract the required data and save it to the panda data frame.

I wrote the code:

from bs4 import BeautifulSoup
from lxml import html
import requests as rq
import re
import pandas as pd
import logging
import matplotlib.pyplot as plt
import pdb
import json
base_url='http://dramp.cpu-bioinfor.org/browse/All_Information.php?id='
url='{}DRAMP00005'.format(base_url)
page = rq.get(url)

htmlSoup = BeautifulSoup(page.content, "lxml")
divs_section=htmlSoup.select("div.bs-docs-section") # get the contents of all sections such as General Information, Activity Information, 
#print(len(divs_section))
print(divs_section[1])

new_table = pd.DataFrame(columns=range(0,14), index = [url])
new_table.columns=['DRAMP ID','Peptide Name','Source','Family','Gene', 'Sequence','Sequence Length','UniProt Entry','Protein Existence', 'Biological Activity','Target Organism','Hemolytic Activity','Cytotoxicity','Binding Target']

#Add all text from html elements td width=61%
row_marker = 0
for dv in range(0,1): #range(len(divs_section)): 
  container=divs_section[dv].find_all("ul", {'class':'list-inline'})  # the content inside each section
  #print(container)
  for row in range(len(container)):
    column_marker = 0
    columns =container[row].find_all('li')
    # print(len(columns))
    for col in columns:
      print(col.get_text())
      
      new_table.iat[row_marker,column_marker] = col.get_text()
      column_marker += 1
    #print(len(container[row].find_all('li')))
new_table

and the output looked like this:

6
DRAMP ID
DRAMP00005
Peptide Name
Epicidin 280 (Bacteriocin)
Source
Staphylococcus epidermidis BN 280 (Gram-positive bacteria)
Family
Belongs to the lantibiotic family (Class I bacteriocin)
Gene
eciA
Sequence
SLGPAIKATRQVCPKATRFVTVSCKKSDCQ
Sequence Length
30
UniProt Entry

O54220

Protein Existence
Protein level

DRAMP ID    Peptide Name    Source  Family  Gene    Sequence    Sequence Length UniProt Entry   Protein Existence   Biological Activity Target Organism Hemolytic Activity  Cytotoxicity    Binding Target
http://dramp.cpu-bioinfor.org/browse/All_Information.php?id=DRAMP00005  Protein Existence   Protein level   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

I am new to web-scraping and practicing to learn. I want to get the values in the columns of the table: for example, DRAMP00005 in column with title DRAMP ID and etc. How to correct this?

My other questions is how to repeat web scraping:

  1. on the same page and extracting from the 20 records (each record has a link directing to the information that needs to be extracted),
  2. Moving from the pages one by one until it reaches the last page of records (i.e., page 285)

Upvotes: 1

Views: 531

Answers (1)

HedgeHog
HedgeHog

Reputation: 25048

You could use dict comprehension and more specific selection of elements to get your goal - Select all <li> that holds column header specific strings, while iterating set them text as key and its find_next_sibling('li') text as value.

dict(
    (e.get_text(strip=True),e.find_next_sibling('li').get_text(strip=True)) 
    for e in soup_dp.select('.list-inline>li:has(+li)')
)

To iterate all pages and detail pages use a while-loop and break it until your check against availability of next button fails.

Example

Note example starts for demonstration on &pageNow=284, you could set this to 1to retrieve all results.

import pandas as pd
from bs4 import BeautifulSoup
import requests

url = 'http://dramp.cpu-bioinfor.org/search/advanced_search.php?geneinfo_data%5B0%5D=&boo_gene%5B0%5D=And&geneinfo_data%5B1%5D=&boo_gene%5B1%5D=And&length=&boo_length=And&geneinfo_data%5B2%5D=&boo_gene%5B2%5D=And&geneinfo_data%5B3%5D=&boo_gene%5B3%5D=And&geneinfo_data%5B4%5D=&boo_gene%5B4%5D=And&ckbx1%5B%5D=&ckbx1%5B%5D=Antimicrobial&boo_act=And&activity%5B0%5D=&bool_cactivity%5B0%5D=And&comments%5B0%5D=&bool_comments%5B0%5D=And&comments%5B1%5D=&bool_comments%5B1%5D=And&db=&db_id=&end=285&begin=280&pageNow=284'

data = []
while True:

    soup = BeautifulSoup(requests.get(url).text)

    for dp in ['http://dramp.cpu-bioinfor.org'+a.get('href').strip('..') for a in soup.select('[summary="The Result Of Ser"] tr td:nth-of-type(2) a')]:
        d={'url':dp}
        soup_dp = BeautifulSoup(requests.get(dp).text)
        d.update(
            dict(
                (e.get_text(strip=True),e.find_next_sibling('li').get_text(strip=True)) 
                for e in soup_dp.select('.list-inline>li:has(+li)')
            )
        )
        data.append(d)

    if soup.select_one('a:-soup-contains("Next >")'):
        url='http://dramp.cpu-bioinfor.org'+soup.select_one('a:-soup-contains("Next >")').get('href')
    else:
        break

pd.DataFrame(data)

Output

url DRAMP ID Peptide Name Source Family Gene Sequence Sequence Length UniProt Entry Protein Existence Biological Activity Target Organism Hemolytic Activity Cytotoxicity Binding Target Linear/Cyclic N-terminal Modification C-terminal Modification Nonterminal Modifications and Unusual Amino Acids Stereochemistry Structure Structure Description PDB ID Formula Absent Amino Acids Common Amino Acids Mass PI Basic Residues Acidic Residues Hydrophobic Residues Net Charge Boman Index Hydrophobicity Aliphatic Index Half Life Extinction Coefficient Cystines Absorbance 280nm Polar Residues Function Title Pubmed ID Reference Author Mechanism
0 http://dramp.cpu-bioinfor.org/browse/All_Information.php?id=DRAMP29276&dataset= DRAMP29276 E1P41-1 Synthetic construct Not found Not found KWESEFWRWTEQLASNYW 18 No entry found Not found Antimicrobial,Antiviral [Ref.26905802]Virus:HIV-1 NL4-3:inhibition of virus infection in TZM-bl cells(IC50=66.7 ± 20.2 μM). No hemolysis information or data found in the reference(s) presented in this entry No cytotoxicity information found in the reference(s) presented gp41 Linear Free Free None L Not found Not found None C117H152N28O31 CDGHIMPV W 2446.66 4.79 2 3 7 -1 -4356 -1.372 27.22 Mammalian:1.3 hourYeast:3 minE.coli:2 min 23490 1381.76 5 Antiviral activity against Influenza virus. Definition of an 18-mer Synthetic Peptide Derived from the GB virus C E1 Protein as a New HIV-1 Entry Inhibitor. 26905802 Biochim Biophys Acta. 2016 Jun;1860(6):1139-48. Gómara MJ, Sánchez-Merino V, Paús A, Merino-Mansilla A, Gatell JM, Yuste E, Haro I. nan
1 http://dramp.cpu-bioinfor.org/browse/All_Information.php?id=DRAMP29277 DRAMP29277 E1P41-2 Synthetic construct Not found Not found WESEFWRWTEQLASNYWI 18 No entry found Not found Antimicrobial,Antiviral [Ref.26905802]Virus:HIV-1 NL4-3:inhibition of virus infection in TZM-bl cells(IC50=22.0 ± 0.0 μM). No hemolysis information or data found in the reference(s) presented in this entry No cytotoxicity information found in the reference(s) presented gp41 Linear Free Free None L Not found Not found None C117H151N27O31 CDGHKMPV W 2431.65 4.25 1 3 8 -2 -3309 -0.906 48.89 Mammalian:2.8 hourYeast:3 minE.coli:2 min 23490 1381.76 5 Antiviral activity against Influenza virus. Definition of an 18-mer Synthetic Peptide Derived from the GB virus C E1 Protein as a New HIV-1 Entry Inhibitor. 26905802 Biochim Biophys Acta. 2016 Jun;1860(6):1139-48. Gómara MJ, Sánchez-Merino V, Paús A, Merino-Mansilla A, Gatell JM, Yuste E, Haro I. nan
2 http://dramp.cpu-bioinfor.org/browse/All_Information.php?id=DRAMP29278&dataset= DRAMP29278 E1P42 Synthetic construct Not found Not found ESEFWRWTEQLASNYWIL 18 No entry found Not found Antimicrobial,Antiviral [Ref.26905802]Virus:HIV-1 NL4-3:inhibition of virus infection in TZM-bl cells(IC50=50.0 ± 8.7 μM). No hemolysis information or data found in the reference(s) presented in this entry No cytotoxicity information found in the reference(s) presented gp41 Linear Free Free None L Not found Not found None C112H152N26O31 CDGHKMPV EW 2358.59 4.25 1 3 8 -2 -3050 -0.644 70.56 Mammalian:1 hourYeast:30 minE.coli:>10 hour 17990 1058.24 5 Antiviral activity against Influenza virus. Definition of an 18-mer Synthetic Peptide Derived from the GB virus C E1 Protein as a New HIV-1 Entry Inhibitor. 26905802 Biochim Biophys Acta. 2016 Jun;1860(6):1139-48. Gómara MJ, Sánchez-Merino V, Paús A, Merino-Mansilla A, Gatell JM, Yuste E, Haro I. nan
3 http://dramp.cpu-bioinfor.org/browse/All_Information.php?id=DRAMP29279 DRAMP29279 E1P42-1 Synthetic construct Not found Not found SEFWRWTEQLASNYWILE 18 No entry found Not found Antimicrobial,Antiviral [Ref.26905802]Virus:HIV-1 NL4-3:inhibition of virus infection in TZM-bl cells(IC50=31.0 ± 3.5 μM). No hemolysis information or data found in the reference(s) presented in this entry No cytotoxicity information found in the reference(s) presented gp41 Linear Free Free None L Not found Not found None C112H152N26O31 CDGHKMPV EW 2358.59 4.25 1 3 8 -2 -3050 -0.644 70.56 Mammalian:1.9 hourYeast:>20 hourE.coli:>10 hour 17990 1058.24 5 Antiviral activity against Influenza virus. Definition of an 18-mer Synthetic Peptide Derived from the GB virus C E1 Protein as a New HIV-1 Entry Inhibitor. 26905802 Biochim Biophys Acta. 2016 Jun;1860(6):1139-48. Gómara MJ, Sánchez-Merino V, Paús A, Merino-Mansilla A, Gatell JM, Yuste E, Haro I. nan
4 http://dramp.cpu-bioinfor.org/browse/All_Information.php?id=DRAMP29280&dataset= DRAMP29280 E1P42-2 Synthetic construct Not found Not found EFWRWTEQLASNYWILEY 18 No entry found Not found Antimicrobial,Antiviral [Ref.26905802]Virus:HIV-1 NL4-3:inhibition of virus infection in TZM-bl cells(IC50>125 μM). No hemolysis information or data found in the reference(s) presented in this entry No cytotoxicity information found in the reference(s) presented gp41 Linear Free Free None L Not found Not found None C118H156N26O31 CDGHKMPV EW 2434.69 4.25 1 3 8 -2 -2724 -0.672 70.56 Mammalian:1 hourYeast:30 minE.coli:>10 hour 19480 1145.88 5 Antiviral activity against Influenza virus. Definition of an 18-mer Synthetic Peptide Derived from the GB virus C E1 Protein as a New HIV-1 Entry Inhibitor. 26905802 Biochim Biophys Acta. 2016 Jun;1860(6):1139-48. Gómara MJ, Sánchez-Merino V, Paús A, Merino-Mansilla A, Gatell JM, Yuste E, Haro I. nan

...

Upvotes: 1

Related Questions