Reputation: 99
I have a python file for webscraping : scrapper.py::
from bs4 import BeautifulSoup
import requests
source = requests.get('https://en.wikipedia.org/wiki/Willis').text
soup = BeautifulSoup(source,'lxml')
def my_function():
heading = soup.find('h1',{'id':'firstHeading'}).text
print(heading)
print()
for item in soup.select("#mw-content-text"):
required_data = [p_item.text for p_item in item.select("p")][1:3]
print('\n'.join(required_data).encode('utf-8'))
Willis= soup.find("caption",{"class":"fn org"}).text
print(Willis)
print()
I want to use spacy to extract entities from scrapper.py :: pyspacy.py
import spacy
import scrapper
entity_list = []
nlp = spacy.load("en_core_web_sm")
doc = nlp(scrapper.my_function())
for entity in doc.ents:
entity_list.append((entity.text, entity.label_))
print(entity_list)
It just gives me the output:: in terminal for the scraped data along with error::
**
Traceback (most recent call last):
File "hakuna_spacy.py", line 12, in <module>
doc = nlp(printwo.pubb())
File "C:\Users\Hp\AppData\Local\Programs\Python\Python37\lib\site-packages\spacy\language.py",
line 423, in __call__
if len(text) > self.max_length:
TypeError: object of type 'NoneType' has no len()
**
What is that I'm doing wrong? Can someone explain me please?
Upvotes: 0
Views: 379
Reputation: 1804
In your initial code snippet, you had the problem that pubb
prints text to stdout
but does not return a value. You would try instead:
def pubb():
return 'hello, world'
[Edit]:
In the edited version, there are some other issues I can see.
The fetch works, so:
>>> source = requests.get('https://en.wikipedia.org/wiki/Willis').text
>>> len(source)
36836
bs4 correctly finds the heading too:
>>> soup = BeautifulSoup(source,'lxml')
>>> soup.find('h1',{'id':'firstHeading'}).text
'Willis'
bs4 also finds an item in the content section (just 1):
>>> len(soup.select("#mw-content-text"))
1
The trouble then is that it doesn't find any content per se:
>>> soup.select("#mw-content-text")[0].select("p")[1:3]
[]
And it doesn't find the caption:
>>> soup.find("caption",{"class":"fn org"})
>>>
You also have the pre-existing issue that you are not returning any text from my_function
, so the wrapper that passes the return value of that function into the spacy
call is passed None
which gives you the exception. What do you want my_function
to return?
Upvotes: 1