Reputation: 2226
I am trying to write a web scraping code for this link, which has around 300 pages. There are some links (records) on each page that the data needs to be taken from the page of that record. For example, the DRAMP00005, has different fields, that I need to extract the required data and save it to the panda data frame.
I wrote the code:
from bs4 import BeautifulSoup
from lxml import html
import requests as rq
import re
import pandas as pd
import logging
import matplotlib.pyplot as plt
import pdb
import json
base_url='http://dramp.cpu-bioinfor.org/browse/All_Information.php?id='
url='{}DRAMP00005'.format(base_url)
page = rq.get(url)
htmlSoup = BeautifulSoup(page.content, "lxml")
divs_section=htmlSoup.select("div.bs-docs-section") # get the contents of all sections such as General Information, Activity Information,
#print(len(divs_section))
print(divs_section[1])
new_table = pd.DataFrame(columns=range(0,14), index = [url])
new_table.columns=['DRAMP ID','Peptide Name','Source','Family','Gene', 'Sequence','Sequence Length','UniProt Entry','Protein Existence', 'Biological Activity','Target Organism','Hemolytic Activity','Cytotoxicity','Binding Target']
#Add all text from html elements td width=61%
row_marker = 0
for dv in range(0,1): #range(len(divs_section)):
container=divs_section[dv].find_all("ul", {'class':'list-inline'}) # the content inside each section
#print(container)
for row in range(len(container)):
column_marker = 0
columns =container[row].find_all('li')
# print(len(columns))
for col in columns:
print(col.get_text())
new_table.iat[row_marker,column_marker] = col.get_text()
column_marker += 1
#print(len(container[row].find_all('li')))
new_table
and the output looked like this:
6
DRAMP ID
DRAMP00005
Peptide Name
Epicidin 280 (Bacteriocin)
Source
Staphylococcus epidermidis BN 280 (Gram-positive bacteria)
Family
Belongs to the lantibiotic family (Class I bacteriocin)
Gene
eciA
Sequence
SLGPAIKATRQVCPKATRFVTVSCKKSDCQ
Sequence Length
30
UniProt Entry
O54220
Protein Existence
Protein level
DRAMP ID Peptide Name Source Family Gene Sequence Sequence Length UniProt Entry Protein Existence Biological Activity Target Organism Hemolytic Activity Cytotoxicity Binding Target
http://dramp.cpu-bioinfor.org/browse/All_Information.php?id=DRAMP00005 Protein Existence Protein level NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
I am new to web-scraping and practicing to learn. I want to get the values in the columns of the table:
for example, DRAMP00005
in column with title DRAMP ID
and etc.
How to correct this?
My other questions is how to repeat web scraping:
Upvotes: 1
Views: 531
Reputation: 25048
You could use dict comprehension
and more specific selection of elements to get your goal - Select all <li>
that holds column header specific strings, while iterating set them text as key
and its find_next_sibling('li')
text as value.
dict(
(e.get_text(strip=True),e.find_next_sibling('li').get_text(strip=True))
for e in soup_dp.select('.list-inline>li:has(+li)')
)
To iterate all pages and detail pages use a while-loop
and break it until your check against availability of next button fails.
Note example starts for demonstration on &pageNow=284
, you could set this to 1
to retrieve all results.
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'http://dramp.cpu-bioinfor.org/search/advanced_search.php?geneinfo_data%5B0%5D=&boo_gene%5B0%5D=And&geneinfo_data%5B1%5D=&boo_gene%5B1%5D=And&length=&boo_length=And&geneinfo_data%5B2%5D=&boo_gene%5B2%5D=And&geneinfo_data%5B3%5D=&boo_gene%5B3%5D=And&geneinfo_data%5B4%5D=&boo_gene%5B4%5D=And&ckbx1%5B%5D=&ckbx1%5B%5D=Antimicrobial&boo_act=And&activity%5B0%5D=&bool_cactivity%5B0%5D=And&comments%5B0%5D=&bool_comments%5B0%5D=And&comments%5B1%5D=&bool_comments%5B1%5D=And&db=&db_id=&end=285&begin=280&pageNow=284'
data = []
while True:
soup = BeautifulSoup(requests.get(url).text)
for dp in ['http://dramp.cpu-bioinfor.org'+a.get('href').strip('..') for a in soup.select('[summary="The Result Of Ser"] tr td:nth-of-type(2) a')]:
d={'url':dp}
soup_dp = BeautifulSoup(requests.get(dp).text)
d.update(
dict(
(e.get_text(strip=True),e.find_next_sibling('li').get_text(strip=True))
for e in soup_dp.select('.list-inline>li:has(+li)')
)
)
data.append(d)
if soup.select_one('a:-soup-contains("Next >")'):
url='http://dramp.cpu-bioinfor.org'+soup.select_one('a:-soup-contains("Next >")').get('href')
else:
break
pd.DataFrame(data)
url | DRAMP ID | Peptide Name | Source | Family | Gene | Sequence | Sequence Length | UniProt Entry | Protein Existence | Biological Activity | Target Organism | Hemolytic Activity | Cytotoxicity | Binding Target | Linear/Cyclic | N-terminal Modification | C-terminal Modification | Nonterminal Modifications and Unusual Amino Acids | Stereochemistry | Structure | Structure Description | PDB ID | Formula | Absent Amino Acids | Common Amino Acids | Mass | PI | Basic Residues | Acidic Residues | Hydrophobic Residues | Net Charge | Boman Index | Hydrophobicity | Aliphatic Index | Half Life | Extinction Coefficient Cystines | Absorbance 280nm | Polar Residues | Function | Title | Pubmed ID | Reference | Author | Mechanism | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | http://dramp.cpu-bioinfor.org/browse/All_Information.php?id=DRAMP29276&dataset= | DRAMP29276 | E1P41-1 | Synthetic construct | Not found | Not found | KWESEFWRWTEQLASNYW | 18 | No entry found | Not found | Antimicrobial,Antiviral | [Ref.26905802]Virus:HIV-1 NL4-3:inhibition of virus infection in TZM-bl cells(IC50=66.7 ± 20.2 μM). | No hemolysis information or data found in the reference(s) presented in this entry | No cytotoxicity information found in the reference(s) presented | gp41 | Linear | Free | Free | None | L | Not found | Not found | None | C117H152N28O31 | CDGHIMPV | W | 2446.66 | 4.79 | 2 | 3 | 7 | -1 | -4356 | -1.372 | 27.22 | Mammalian:1.3 hourYeast:3 minE.coli:2 min | 23490 | 1381.76 | 5 | Antiviral activity against Influenza virus. | Definition of an 18-mer Synthetic Peptide Derived from the GB virus C E1 Protein as a New HIV-1 Entry Inhibitor. | 26905802 | Biochim Biophys Acta. 2016 Jun;1860(6):1139-48. | Gómara MJ, Sánchez-Merino V, Paús A, Merino-Mansilla A, Gatell JM, Yuste E, Haro I. | nan |
1 | http://dramp.cpu-bioinfor.org/browse/All_Information.php?id=DRAMP29277 | DRAMP29277 | E1P41-2 | Synthetic construct | Not found | Not found | WESEFWRWTEQLASNYWI | 18 | No entry found | Not found | Antimicrobial,Antiviral | [Ref.26905802]Virus:HIV-1 NL4-3:inhibition of virus infection in TZM-bl cells(IC50=22.0 ± 0.0 μM). | No hemolysis information or data found in the reference(s) presented in this entry | No cytotoxicity information found in the reference(s) presented | gp41 | Linear | Free | Free | None | L | Not found | Not found | None | C117H151N27O31 | CDGHKMPV | W | 2431.65 | 4.25 | 1 | 3 | 8 | -2 | -3309 | -0.906 | 48.89 | Mammalian:2.8 hourYeast:3 minE.coli:2 min | 23490 | 1381.76 | 5 | Antiviral activity against Influenza virus. | Definition of an 18-mer Synthetic Peptide Derived from the GB virus C E1 Protein as a New HIV-1 Entry Inhibitor. | 26905802 | Biochim Biophys Acta. 2016 Jun;1860(6):1139-48. | Gómara MJ, Sánchez-Merino V, Paús A, Merino-Mansilla A, Gatell JM, Yuste E, Haro I. | nan |
2 | http://dramp.cpu-bioinfor.org/browse/All_Information.php?id=DRAMP29278&dataset= | DRAMP29278 | E1P42 | Synthetic construct | Not found | Not found | ESEFWRWTEQLASNYWIL | 18 | No entry found | Not found | Antimicrobial,Antiviral | [Ref.26905802]Virus:HIV-1 NL4-3:inhibition of virus infection in TZM-bl cells(IC50=50.0 ± 8.7 μM). | No hemolysis information or data found in the reference(s) presented in this entry | No cytotoxicity information found in the reference(s) presented | gp41 | Linear | Free | Free | None | L | Not found | Not found | None | C112H152N26O31 | CDGHKMPV | EW | 2358.59 | 4.25 | 1 | 3 | 8 | -2 | -3050 | -0.644 | 70.56 | Mammalian:1 hourYeast:30 minE.coli:>10 hour | 17990 | 1058.24 | 5 | Antiviral activity against Influenza virus. | Definition of an 18-mer Synthetic Peptide Derived from the GB virus C E1 Protein as a New HIV-1 Entry Inhibitor. | 26905802 | Biochim Biophys Acta. 2016 Jun;1860(6):1139-48. | Gómara MJ, Sánchez-Merino V, Paús A, Merino-Mansilla A, Gatell JM, Yuste E, Haro I. | nan |
3 | http://dramp.cpu-bioinfor.org/browse/All_Information.php?id=DRAMP29279 | DRAMP29279 | E1P42-1 | Synthetic construct | Not found | Not found | SEFWRWTEQLASNYWILE | 18 | No entry found | Not found | Antimicrobial,Antiviral | [Ref.26905802]Virus:HIV-1 NL4-3:inhibition of virus infection in TZM-bl cells(IC50=31.0 ± 3.5 μM). | No hemolysis information or data found in the reference(s) presented in this entry | No cytotoxicity information found in the reference(s) presented | gp41 | Linear | Free | Free | None | L | Not found | Not found | None | C112H152N26O31 | CDGHKMPV | EW | 2358.59 | 4.25 | 1 | 3 | 8 | -2 | -3050 | -0.644 | 70.56 | Mammalian:1.9 hourYeast:>20 hourE.coli:>10 hour | 17990 | 1058.24 | 5 | Antiviral activity against Influenza virus. | Definition of an 18-mer Synthetic Peptide Derived from the GB virus C E1 Protein as a New HIV-1 Entry Inhibitor. | 26905802 | Biochim Biophys Acta. 2016 Jun;1860(6):1139-48. | Gómara MJ, Sánchez-Merino V, Paús A, Merino-Mansilla A, Gatell JM, Yuste E, Haro I. | nan |
4 | http://dramp.cpu-bioinfor.org/browse/All_Information.php?id=DRAMP29280&dataset= | DRAMP29280 | E1P42-2 | Synthetic construct | Not found | Not found | EFWRWTEQLASNYWILEY | 18 | No entry found | Not found | Antimicrobial,Antiviral | [Ref.26905802]Virus:HIV-1 NL4-3:inhibition of virus infection in TZM-bl cells(IC50>125 μM). | No hemolysis information or data found in the reference(s) presented in this entry | No cytotoxicity information found in the reference(s) presented | gp41 | Linear | Free | Free | None | L | Not found | Not found | None | C118H156N26O31 | CDGHKMPV | EW | 2434.69 | 4.25 | 1 | 3 | 8 | -2 | -2724 | -0.672 | 70.56 | Mammalian:1 hourYeast:30 minE.coli:>10 hour | 19480 | 1145.88 | 5 | Antiviral activity against Influenza virus. | Definition of an 18-mer Synthetic Peptide Derived from the GB virus C E1 Protein as a New HIV-1 Entry Inhibitor. | 26905802 | Biochim Biophys Acta. 2016 Jun;1860(6):1139-48. | Gómara MJ, Sánchez-Merino V, Paús A, Merino-Mansilla A, Gatell JM, Yuste E, Haro I. | nan |
...
Upvotes: 1