Ajax
Ajax

Reputation: 99

Extract entities from spacy

I have a python file for webscraping : scrapper.py::

from bs4 import BeautifulSoup
import requests
source = requests.get('https://en.wikipedia.org/wiki/Willis').text
soup = BeautifulSoup(source,'lxml')

def my_function():

    heading = soup.find('h1',{'id':'firstHeading'}).text
    print(heading)
    print()
for item in soup.select("#mw-content-text"):
        required_data = [p_item.text for p_item in item.select("p")][1:3]
        print('\n'.join(required_data).encode('utf-8'))

    Willis= soup.find("caption",{"class":"fn org"}).text
    print(Willis)
    print()

I want to use spacy to extract entities from scrapper.py :: pyspacy.py

import spacy
import scrapper

entity_list = []

nlp = spacy.load("en_core_web_sm")


doc = nlp(scrapper.my_function())

for entity in doc.ents:
    entity_list.append((entity.text, entity.label_))
print(entity_list)

It just gives me the output:: in terminal for the scraped data along with error::

** 
Traceback (most recent call last):
  File "hakuna_spacy.py", line 12, in <module>
    doc = nlp(printwo.pubb())
  File "C:\Users\Hp\AppData\Local\Programs\Python\Python37\lib\site-packages\spacy\language.py", 
line 423, in __call__
    if len(text) > self.max_length:

TypeError: object of type 'NoneType' has no len()

 **

What is that I'm doing wrong? Can someone explain me please?

Upvotes: 0

Views: 379

Answers (1)

w08r
w08r

Reputation: 1804

In your initial code snippet, you had the problem that pubb prints text to stdout but does not return a value. You would try instead:

def pubb():
    return 'hello, world'

[Edit]:

In the edited version, there are some other issues I can see.

The fetch works, so:

>>> source = requests.get('https://en.wikipedia.org/wiki/Willis').text
>>> len(source)
36836

bs4 correctly finds the heading too:

>>> soup = BeautifulSoup(source,'lxml')
>>> soup.find('h1',{'id':'firstHeading'}).text
'Willis'

bs4 also finds an item in the content section (just 1):

>>> len(soup.select("#mw-content-text"))
1

The trouble then is that it doesn't find any content per se:

>>> soup.select("#mw-content-text")[0].select("p")[1:3]
[]

And it doesn't find the caption:

>>> soup.find("caption",{"class":"fn org"})                                                                                                                                                                   
>>>

You also have the pre-existing issue that you are not returning any text from my_function, so the wrapper that passes the return value of that function into the spacy call is passed None which gives you the exception. What do you want my_function to return?

Upvotes: 1

Related Questions