Reputation: 131
I have a xml file that contains info about Google Scholar record. In particular I want to extract the abstract under the tag "jats:abstract". I used the function
xml_abstract = soup.find_all("jats:abstract")
in all the xml file and I retrieve this:
<class 'bs4.element.ResultSet'>
[<jats:abstract xml:lang="en" xmlns:jats="http://www.ncbi.nlm.nih.gov/JATS1">
<jats:title>Summary</jats:title>
<jats:p>Established primary prevention strategies of cardiovascular diseases are based on understanding of risk factors, but whether the same risk factors are associated with atrial fibrillation (AF) remains unclear. We conducted a systematic review and field synopsis of the associations of 23 cardiovascular risk factors and incident AF, which included 84 reports based on 28 consented and four electronic health record cohorts of 20,420,175 participants and 576,602 AF events. We identified 3-19 reports per risk factor and heterogeneity in AF definition, quality of reporting, and adjustment. We extracted relative risks (RR) and 95 % confidence intervals [CI] and visualised the number of reports with inverse (RR [CI]<1.00), or direct (RR [CI]>1.00) associations. For hypertension (13/17 reports) and obesity (19/19 reports), there were direct associations with incident AF, as there are for coronary heart disease (CHD). There were inverse associations for non-White ethnicity (5/5 reports, with RR from 0.35 to 0.84 [0.82–0.85]), total cholesterol (4/13 reports from 0.76 [0.59–0.98] to 0.94 [0.90–0.97]; 8/13 reports with non-significant inverse associations), and diastolic blood pressure (2/11 reports from 0.87 [0.78–0.96] to 0.92 [0.85–0.99]; 5/11 reports with non-significant inverse associations), and direct associations for taller height (7/10 reports from 1.03 [1.02–1.05] to 1.92 [1.38–2.67]), which are in the opposite direction of known associations with CHD. A systematic evaluation of the available evidence suggests similarities as well as important differences in the risk factors for incidence of AF as compared with other cardiovascular diseases, which has implications for the primary prevention strategies for atrial fibrillation.</jats:p>
</jats:abstract>]
How can I extract only the text with the abstract without tags (like jats:abstract) or other characters?
Upvotes: 1
Views: 2364
Reputation: 1724
As maloney13 already mentioned, find_all()
returns a list()
thus you need to iterate over the xml_abstract
to do something with each element in a loop, rather than extracting the whole thing.
for item in soup.find_all("jats:abstract"):
text = item.find("jats:p").text
Or using list comprehension
# "\n".join() will return each text element on a new line
text = "\n".join([item.find("jats:p").text for item in soup.find_all("jats:abstract")])
Have a look at examples in the online IDE about parsing Google Scholar (both using bs4
, requests
and API alternative), or a blog post I wrote, again, about how to parse data from Google Scholar using Python.
Alternatively, if want to parse data directly from Google Scholar, you can use Google Scholar API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to reinvent the wheel, figure out how to bypass blocks from Google or other search engines, and focus on the data you want to extract instead. There's no need to maintain the parser over time if something in the HTML will be changed.
Example code to integrate to parse Organic Results data:
from serpapi import GoogleSearch
import os, json # json just for pretty output
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google_scholar",
"q": "Samsung", # search query
}
search = GoogleSearch(params)
results = search.get_dict()
# temporary storing extracted data
data = []
for result in results['organic_results']:
data.append({
'result_id': result['result_id'], # result id is needed if you need to parse Cite results
'title': result['title'],
'link': result['link'],
'publication_info': result['publication_info']['summary'],
'snippet': result['snippet'],
'cited_by': result['inline_links']['cited_by']['link'],
'related_versions': result['inline_links']['related_pages_link'],
})
print(json.dumps(data, indent=2, ensure_ascii=False))
# part of the output
'''
[
{
"result_id": "U8bh6Ca9uwQJ",
"title": "“What? I thought Samsung was Japanese”: accurate or not, perceived country of origin matters",
"link": "https://www.emerald.com/insight/content/doi/10.1108/02651331111167589/full/html",
"publication_info": "P Magnusson, SA Westjohn… - International Marketing …, 2011 - emerald.com",
"snippet": "Purpose–Extensive research has shown that country‐of‐origin (COO) information significantly affects product evaluations and buying behavior. Yet recently, a competing perspective has emerged suggesting that COO effects have been inflated in prior research …",
"cited_by": "https://scholar.google.com/scholar?cites=341074171610121811&as_sdt=2005&sciodt=0,5&hl=en",
"related_versions": "https://scholar.google.com/scholar?q=related:U8bh6Ca9uwQJ:scholar.google.com/&scioq=samsung&hl=en&as_sdt=0,5"
}
# other results
]
'''
Example code to integrate to parse Cite data:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"), # or just copy-paste your API key without os.getenv()
"engine": "google_scholar_cite",
"q": "FDc6HiktlqEJ" # unique id from organic results
}
search = GoogleSearch(params)
results = search.get_dict()
for cite in results['citations']:
print(f'Title: {cite["title"]}\nSnippet: {cite["snippet"]}\n')
# part of the output
'''
Title: MLA
Snippet: Schwertmann, U. T. R. M., and Reginald M. Taylor. "Iron oxides." Minerals in soil environments 1 (1989): 379-438.
Title: APA
Snippet: Schwertmann, U. T. R. M., & Taylor, R. M. (1989). Iron oxides. Minerals in soil environments, 1, 379-438.
'''
Disclaimer, I work for SerpApi.
Upvotes: 1
Reputation: 235
I think you would do:
for item in xml_abstract:
p_tag = item.find('jats:p')
text = p_tag.text
Does that work for you?
Upvotes: 1