Reputation: 519
I am web scraping abstracts from Pubmed.gov and while I'm able to get the text I need, the word "abstract" is being combined with the first word of the abstract. Here's a sample abstract: https://www.ncbi.nlm.nih.gov/pubmed/30470520
For example, the first word becomes "AbstractBACKGROUND:"
The problem is that an abstract sometimes could be "AbstractBACKGROUND", "AbstractINTRODUCTION" or another word (I won't know). Nevertheless, it will always have "Abstract" in the beginning. Otherwise, I would just run a replace command and take out the abstract part.
I would prefer to either take out "Abstract" of the word or there to be a line break between Abstract and the first word, like this:
Abstract
INTRODUCTION:
I know using the replace command won't work, but I wanted to demonstrate that as a n00b, I at least tried. I appreciate any help to make this work! Here's my code below:
import requests
import re
from bs4 import BeautifulSoup
from urllib.request import urlopen
listofa_urls = ['https://www.ncbi.nlm.nih.gov/pubmed/30470520',
'https://www.ncbi.nlm.nih.gov/pubmed/31063262']
for l in listofa_urls:
response = requests.get(l)
soup = BeautifulSoup(response.content, 'html.parser')
x = soup.find(class_='abstr').get_text()
x = x.replace('abstract','abstract: ')
print(x)
Upvotes: 0
Views: 762
Reputation: 82765
Use re.sub
Ex:
import requests
import re
from bs4 import BeautifulSoup
from urllib.request import urlopen
listofa_urls = ['https://www.ncbi.nlm.nih.gov/pubmed/30470520',
'https://www.ncbi.nlm.nih.gov/pubmed/31063262']
for l in listofa_urls:
response = requests.get(l)
soup = BeautifulSoup(response.content, 'html.parser')
x = soup.find(class_='abstr').get_text()
print(x.encode("utf-8"))
x = re.sub(r"\babstract(.*?)", r"\1", x, flags=re.I)
print(x.encode("utf-8"))
Output:
b'AbstractBACKGROUND: The amount of insulin needed to...
b'BACKGROUND: The amount of insulin needed to ....
b'AbstractCirrhosis is morbid and increasingly prevalent - ...
b'Cirrhosis is morbid and increasingly prevalent -...
Upvotes: 3