udi
udi

Reputation: 3853

Web Scraping does not give the desired results

I am trying to scrape some data from a website and the HTML code would look like as follows.

<div class="field-wrapper field field-node--field-test-synonyms field-name-field-test-synonyms field-type-string field-label-inline clearfix">
      <div class="field-label">Also Known As</div>
    <div class="field-items">
          <div class="field-item">KOH Prep</div>
          <div class="field-item">Fungal Smear, Culture, Antigen and Antibody Tests</div>
          <div class="field-item">Mycology Tests</div>
          <div class="field-item">Fungal Molecular Tests</div>
          <div class="field-item">Potassium Hydroxide Preparation</div>
          <div class="field-item">Calcofluor White Stain</div>
      </div>
</div>

The output what I want to get is OH Prep, Fungal Smear, Culture, Antigen and Antibody Tests, Mycology Tests, Fungal Molecular Tests...

But I don't get any output. My code us as follows.

def get_similar_names(sub_url):
    response = requests.get(sub_url)
    soup = BeautifulSoup(response.content, 'html.parser')
    if(soup.find('div', class_='field-label')!= None):
        other_names = [
            tag.next.next.get_text(strip=True, separator='|').split('|')
            for tag in soup.find('div', class_='field-label')
        ]
        return (other_names[0])
    else:
        return None

The actual link for the web page is this

Upvotes: 0

Views: 67

Answers (2)

Nour-Allah Hussein
Nour-Allah Hussein

Reputation: 1449

When I examined the core of your scraping code as follows:

from bs4 import BeautifulSoup

html_content='''
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
</head>
<body>
<div class="field-wrapper field field-node--field-test-synonyms field-name-field-test-synonyms field-type-string field-label-inline clearfix">
      <div class="field-label">Also Known As</div>
    <div class="field-items">
          <div class="field-item">KOH Prep</div>
          <div class="field-item">Fungal Smear, Culture, Antigen and Antibody Tests</div>
          <div class="field-item">Mycology Tests</div>
          <div class="field-item">Fungal Molecular Tests</div>
          <div class="field-item">Potassium Hydroxide Preparation</div>
          <div class="field-item">Calcofluor White Stain</div>
      </div>
</div>
</body>
</html>
'''

soup = BeautifulSoup(html_content, 'html.parser')
if(soup.find('div', class_='field-label')!= None):
     other_names = [
       tag.next.next.get_text(strip=True, separator='|').split('|')
       for tag in soup.find('div', class_='field-label')
     ]
     print (other_names)

The content of other_names is:

[['KOH Prep', 'Fungal Smear, Culture, Antigen and Antibody Tests', 'Mycology Tests', 'Fungal Molecular Tests', 'Potassium Hydroxide Preparation', 'Calcofluor White Stain']]

That match with your target results.

As your code give the target result, therefore, you may have a problem elsewhere, in the sent sub_url for example.

Upvotes: 1

HedgeHog
HedgeHog

Reputation: 25048

There are different approaches to get the names.

#1 - Get all names joined as a string as you expected output:

soup.select_one('div.field-items').get_text(',',strip=True)

Output -> KOH Prep,Fungal Smear, Culture, Antigen and Antibody Tests,Mycology Tests,Fungal Molecular Tests,Potassium Hydroxide Preparation,Calcofluor White Stain

#2 - Get all namesas a list:

[name.get_text() for name in soup.select('div.field-items > div')]

Output -> ['KOH Prep','Fungal Smear, Culture, Antigen and Antibody Tests','Mycology Tests','Fungal Molecular Tests','Potassium Hydroxide Preparation','Calcofluor White Stain']

#3 _ Get only the first name as in your code:

soup.select_one('div.field-items > div').get_text()

Output -> KOH Prep

Example

def get_similar_names(sub_url):
    response = requests.get(sub_url)
    soup = BeautifulSoup(response.content, 'html.parser')
    other_names = soup.select_one('div.field-items').get_text(',',strip=True)

    return other_names

Output

KOH Prep,Fungal Smear, Culture, Antigen and Antibody Tests,Mycology Tests,Fungal Molecular Tests,Potassium Hydroxide Preparation,Calcofluor White Stain

Upvotes: 2

Related Questions