Python Web scraping Pubmed Abstract - "Abstract" is consolidated with first word of (e.g., "AbstractINTRODUCTION:")

Question

I am web scraping abstracts from Pubmed.gov and while I'm able to get the text I need, the word "abstract" is being combined with the first word of the abstract. Here's a sample abstract: https://www.ncbi.nlm.nih.gov/pubmed/30470520

For example, the first word becomes "AbstractBACKGROUND:"

The problem is that an abstract sometimes could be "AbstractBACKGROUND", "AbstractINTRODUCTION" or another word (I won't know). Nevertheless, it will always have "Abstract" in the beginning. Otherwise, I would just run a replace command and take out the abstract part.

I would prefer to either take out "Abstract" of the word or there to be a line break between Abstract and the first word, like this:

Abstract

INTRODUCTION:

I know using the replace command won't work, but I wanted to demonstrate that as a n00b, I at least tried. I appreciate any help to make this work! Here's my code below:

import requests
import re
from bs4 import BeautifulSoup
from urllib.request import urlopen

listofa_urls = ['https://www.ncbi.nlm.nih.gov/pubmed/30470520', 
'https://www.ncbi.nlm.nih.gov/pubmed/31063262']

for l in listofa_urls:
   response = requests.get(l)
   soup = BeautifulSoup(response.content, 'html.parser')
   x = soup.find(class_='abstr').get_text()
   x = x.replace('abstract','abstract: ')
   print(x)

Rakesh · Accepted Answer

Use re.sub

Ex:

import requests
import re
from bs4 import BeautifulSoup
from urllib.request import urlopen

listofa_urls = ['https://www.ncbi.nlm.nih.gov/pubmed/30470520', 
'https://www.ncbi.nlm.nih.gov/pubmed/31063262']

for l in listofa_urls:
   response = requests.get(l)
   soup = BeautifulSoup(response.content, 'html.parser')
   x = soup.find(class_='abstr').get_text()
   print(x.encode("utf-8"))
   x = re.sub(r"\babstract(.*?)", r"\1", x, flags=re.I)
   print(x.encode("utf-8"))

Output:

b'AbstractBACKGROUND: The amount of insulin needed to...
b'BACKGROUND: The amount of insulin needed to ....

b'AbstractCirrhosis is morbid and increasingly prevalent - ...
b'Cirrhosis is morbid and increasingly prevalent -...

Python Web scraping Pubmed Abstract - "Abstract" is consolidated with first word of (e.g., "AbstractINTRODUCTION:")

Answers (1)

Related Questions

Python Web scraping Pubmed Abstract - &quot;Abstract&quot; is consolidated with first word of (e.g., &quot;AbstractINTRODUCTION:&quot;)

Answers (1)

Related Questions

Python Web scraping Pubmed Abstract - "Abstract" is consolidated with first word of (e.g., "AbstractINTRODUCTION:")