munir.aygun
munir.aygun

Reputation: 442

Beautiful Soup is not giving actual div tag

I am trying to scraping urban dictionary with Python . But I am currently facing with some issues .

Firstly I decided to scrape

<div class = def-panel ...

div tags ,that contains word's informations . Such as meaning,examples,contributor ... And by the way this def-panel div tags are in the

<div id = "content" ...

or XPath

//*[@id="content"]

That is my class for simple operations in that website .

import requests
import urllib.request
from bs4 import BeautifulSoup
class  UrbanDict:
    URL = "https://www.urbandictionary.com/"
    search_form = "define.php?term={}"
    def __init__(self):
        pass

    def get_soup_response(self,link):
        response = requests.get(link)
        soup = BeautifulSoup(response.text,"html.parser")

        return soup


    def search(self,word):
        soup = self.get_soup_response(self.URL+self.search_form.format(word))
        return soup

And that is my code while I am testing the UrbanDict class .

if __name__ == "__main__":
    urban = UrbanDict() # Creating Object
    soup = urban.search("world") # Getting the page of "world" word . 
    defpanels = soup.find("div",{"id":"content"}).findAll("div",{"class":"def-panel"}) # Getting the panel divs . 
    for defpanel in defpanels: # Iterating panel divs
        word = defpanel.find("div",{"class" : "def-header"}).text # Checking the div is the correct div
        if word.lower() == "world" :
            print("="*64)
            meaning_div = defpanel.find("div",{"class":"meaning"})#getting meaning div of word
            example_div = defpanel.find("div",{"class":"example"})#getting example div  of word

            print(meaning_div)

When i print the divs I see some mismatch . The div code without prettify is not the same as with prettify . The code that I wrote for controlling that :

if __name__ == "__main__":
    urban = UrbanDict() # Creating Object
    soup = urban.search("world") # Getting the page of "world" word . 
    defpanels = soup.find("div",{"id":"content"}).findAll("div",{"class":"def-panel"}) # Getting the panel divs . 
    print("Prettify used \n")

    print(defpanels[2].find("div",{"class" : "meaning"}).prettify(encoding="utf-8").decode("utf-8"))

    print("="*48)
    print("\nPrettify NOT used \n")

    print(defpanels[2].find("div",{"class" : "meaning"}))

And output :

Prettify used 

<div class="meaning">
 A language, derived from English (or English-English, American-English etc. etc. ad nauseam).
 <br/>
 This is the de facto language of international commerce, finance, shipping, aviation, the web, etc.
 <br/>
 It has many dialects.
 <br/>
 Chinglish, Singlish,
 <a class="autolink" href="/define.php?term=Franglais" onclick="ga('send', 'event', 'Autolink', 'Click', &quot;Franglais&quot;);">
  Franglais
 </a>
 and Spanglish spring to mind.
 <br/>
 Acccents include Canadian - which might be boring,
 <a class="autolink" href="/define.php?term=Strine" onclick="ga('send', 'event', 'Autolink', 'Click', &quot;Strine&quot;);">
  Strine
 </a>
 , Kiwi,
 <a class="autolink" href="/define.php?term=Estuary" onclick="ga('send', 'event', 'Autolink', 'Click', &quot;Estuary&quot;);">
  Estuary
 </a>
 , Scouse, Cockney and Hindglish.
 <br/>
 There is one recognised speech impediment
 <br/>
 - this is known as geordie
</div>

================================================

Prettify NOT used 

<br/>Chinglish, Singlish, <a class="autolink" href="/define.php?term=Franglais" onclick="ga('send', 'event', 'Autolink', 'Click', &quot;Franglais&quot;);">Franglais</a<br/>Acccents include Canadian - which might be boring, <a class="autolink" href="/define.php?term=Strine" onclick="ga('send', 'event', 'Autolink', 'Click', &quot;Strine&quot;);">Strine</a>, Kiwi, <a class="autolink" href="/define.php?term=Estuary" onclick="ga('send', 'event', 'Autolink', 'Click', &quot;Estuary&quot;);">Estuary</a>,<br/>- this is known as geordie</div>mpediment

As you can see there is a mismatch . I really wondered, why does it happen?

Upvotes: 1

Views: 214

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195418

The problem is in the used parser. html.parser and lxml parse the tags found in the page incorrectly. Use html5lib to obtain best resutls:

import requests
from bs4 import BeautifulSoup


term = 'world'
url = 'https://www.urbandictionary.com/define.php'
soup = BeautifulSoup(requests.get(url, params={'term': term}).content, 'html5lib')  # <-- use html5lib

for r, m, e in zip(soup.select('.ribbon'),
                   soup.select('.meaning'),
                   soup.select('.example')):
    if 'Word of the Day' in r.text:
        continue
    print(m.text)
    print()
    print(e.text)
    print('-' * 120)

Prints:

the f***d off place where we live

violence, death, corruption- tis the world we live in
------------------------------------------------------------------------------------------------------------------------
A language, derived from English (or English-English, American-English etc. etc. ad nauseam).
This is the de facto language of international commerce, finance, shipping, aviation, the web, etc.
It has many dialects. 
Chinglish, Singlish, Franglais and Spanglish spring to mind.
Acccents include Canadian - which might be boring, Strine, Kiwi, Estuary, Scouse, Cockney and Hindglish.
There is one recognised speech impediment
- this is known as geordie

If you understand this, you understand World.
------------------------------------------------------------------------------------------------------------------------
A word that needs to be defined. Basically its a sphere, floating in space. It has both land and Sea...oh and some air too. it is also called earth. Its pretty tight.

Urban Dictionary. Define your world.
------------------------------------------------------------------------------------------------------------------------
What I rocked last night.

Yeah, he rocked my world.
------------------------------------------------------------------------------------------------------------------------
An alternate term for Earth, the planet we live on, the third from the sun.

Generally, "world" does not describe the physical planet, but rather the community within it.

The world is starting to lose it...
All around the world, people have McDonalds.
What a wonderful world.
------------------------------------------------------------------------------------------------------------------------
a messed up circle where a bunch of crazy kids live. people from marz jupiter venus ect. like to call these crazy kids humans.

"so, where do you live?"
"i live in that circle looking thing, we name that fucked up place the world"
"oh dear"
------------------------------------------------------------------------------------------------------------------------

Upvotes: 1

Related Questions