Reputation: 3853
I am trying to scrape some data from a website and the HTML code would look like as follows.
<div class="field-wrapper field field-node--field-test-synonyms field-name-field-test-synonyms field-type-string field-label-inline clearfix">
<div class="field-label">Also Known As</div>
<div class="field-items">
<div class="field-item">KOH Prep</div>
<div class="field-item">Fungal Smear, Culture, Antigen and Antibody Tests</div>
<div class="field-item">Mycology Tests</div>
<div class="field-item">Fungal Molecular Tests</div>
<div class="field-item">Potassium Hydroxide Preparation</div>
<div class="field-item">Calcofluor White Stain</div>
</div>
</div>
The output what I want to get is OH Prep, Fungal Smear, Culture, Antigen and Antibody Tests, Mycology Tests, Fungal Molecular Tests...
But I don't get any output. My code us as follows.
def get_similar_names(sub_url):
response = requests.get(sub_url)
soup = BeautifulSoup(response.content, 'html.parser')
if(soup.find('div', class_='field-label')!= None):
other_names = [
tag.next.next.get_text(strip=True, separator='|').split('|')
for tag in soup.find('div', class_='field-label')
]
return (other_names[0])
else:
return None
The actual link for the web page is this
Upvotes: 0
Views: 67
Reputation: 1449
When I examined the core of your scraping code as follows:
from bs4 import BeautifulSoup
html_content='''
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<body>
<div class="field-wrapper field field-node--field-test-synonyms field-name-field-test-synonyms field-type-string field-label-inline clearfix">
<div class="field-label">Also Known As</div>
<div class="field-items">
<div class="field-item">KOH Prep</div>
<div class="field-item">Fungal Smear, Culture, Antigen and Antibody Tests</div>
<div class="field-item">Mycology Tests</div>
<div class="field-item">Fungal Molecular Tests</div>
<div class="field-item">Potassium Hydroxide Preparation</div>
<div class="field-item">Calcofluor White Stain</div>
</div>
</div>
</body>
</html>
'''
soup = BeautifulSoup(html_content, 'html.parser')
if(soup.find('div', class_='field-label')!= None):
other_names = [
tag.next.next.get_text(strip=True, separator='|').split('|')
for tag in soup.find('div', class_='field-label')
]
print (other_names)
The content of other_names
is:
[['KOH Prep', 'Fungal Smear, Culture, Antigen and Antibody Tests', 'Mycology Tests', 'Fungal Molecular Tests', 'Potassium Hydroxide Preparation', 'Calcofluor White Stain']]
That match with your target results.
As your code give the target result, therefore, you may have a problem elsewhere, in the sent sub_url
for example.
Upvotes: 1
Reputation: 25048
There are different approaches to get the names.
#1 - Get all names
joined as a string
as you expected output:
soup.select_one('div.field-items').get_text(',',strip=True)
Output -> KOH Prep,Fungal Smear, Culture, Antigen and Antibody Tests,Mycology Tests,Fungal Molecular Tests,Potassium Hydroxide Preparation,Calcofluor White Stain
#2 - Get all names
as a list
:
[name.get_text() for name in soup.select('div.field-items > div')]
Output -> ['KOH Prep','Fungal Smear, Culture, Antigen and Antibody Tests','Mycology Tests','Fungal Molecular Tests','Potassium Hydroxide Preparation','Calcofluor White Stain']
#3 _ Get only the first name
as in your code:
soup.select_one('div.field-items > div').get_text()
Output -> KOH Prep
Example
def get_similar_names(sub_url):
response = requests.get(sub_url)
soup = BeautifulSoup(response.content, 'html.parser')
other_names = soup.select_one('div.field-items').get_text(',',strip=True)
return other_names
Output
KOH Prep,Fungal Smear, Culture, Antigen and Antibody Tests,Mycology Tests,Fungal Molecular Tests,Potassium Hydroxide Preparation,Calcofluor White Stain
Upvotes: 2