Reputation: 608
I have a list of the following authors for Google Scholar papers: Zoe Pikramenou, James H. R. Tucker, Alison Rodger, Timothy Dafforn
. I want to extract and print titles for the papers present for at least 3 of these.
You can get a dictionary of paper info from each author using Scholarly:
from scholarly import scholarly
AuthorList = ['Zoe Pikramenou', 'James H. R. Tucker', 'Alison Rodger', 'Timothy Dafforn']
for Author in AuthorList:
search_query = scholarly.search_author(Author)
author = next(search_query).fill()
print(author)
The output looks something like (just a small excerpt from what you'd get from one author)
{'bib': {'cites': '69',
'title': 'Chalearn looking at people and faces of the world: Face '
'analysis workshop and challenge 2016',
'year': '2016'},
'filled': False,
'id_citations': 'ZhUEBpsAAAAJ:_FxGoFyzp5QC',
'source': 'citations'},
{'bib': {'cites': '21',
'title': 'The NoXi database: multimodal recordings of mediated '
'novice-expert interactions',
'year': '2017'},
'filled': False,
'id_citations': 'ZhUEBpsAAAAJ:0EnyYjriUFMC',
'source': 'citations'},
{'bib': {'cites': '11',
'title': 'Automatic habitat classification using image analysis and '
'random forest',
'year': '2014'},
'filled': False,
'id_citations': 'ZhUEBpsAAAAJ:qjMakFHDy7sC',
'source': 'citations'},
{'bib': {'cites': '10',
'title': 'AutoRoot: open-source software employing a novel image '
'analysis approach to support fully-automated plant '
'phenotyping',
'year': '2017'},
'filled': False,
'id_citations': 'ZhUEBpsAAAAJ:hqOjcs7Dif8C',
'source': 'citations'}
How can I collect the bib
and specifically title
for papers which are present for three or more out of the four authors?
EDIT: in fact it's been pointed out id_citations
is not unique for each paper, my mistake. Better to just use title
itself
Upvotes: 1
Views: 383
Reputation: 2791
Expanding on my comment, you can achieve this using Pandas groupby:
import pandas as pd
from scholarly import scholarly
AuthorList = ['Zoe Pikramenou', 'James H. R. Tucker', 'Alison Rodger', 'Timothy Dafforn']
frames = []
for Author in AuthorList:
search_query = scholarly.search_author(Author)
author = next(search_query).fill()
# creating DataFrame with authors
df = pd.DataFrame([x.__dict__ for x in author.publications])
df['author'] = Author
frames.append(df.copy())
# joining all author DataFrames
df = pd.concat(frames, axis=0)
# taking bib dict into separate columns
df[['title', 'cites', 'year']] = pd.DataFrame(df.bib.to_list())
# counting unique authors attached to each title
n_authors = df.groupby('title').author.nunique()
# locating the unique titles for all publications with n_authors >= 2
output = n_authors[n_authors >= 2].index
This finds 202 papers which have 2 or more of the authors in that list (out of 774 total papers). Here is an example of the output:
Index(['1, 1′-Homodisubstituted ferrocenes containing adenine and thymine nucleobases: synthesis, electrochemistry, and formation of H-bonded arrays',
'722: Iron chelation by biopolymers for an anti-cancer therapy; binding up the'ferrotoxicity'in the colon',
'A Luminescent One-Dimensional Copper (I) Polymer',
'A Unidirectional Energy Transfer Cascade Process in a Ruthenium Junction Self-Assembled by r-and-Cyclodextrins',
'A Zinc(II)-Cyclen Complex Attached to an Anthraquinone Moiety that Acts as a Redox-Active Nucleobase Receptor in Aqueous Solution',
'A ditopic ferrocene receptor for anions and cations that functions as a chromogenic molecular switch',
'A ferrocene nucleic acid oligomer as an organometallic structural mimic of DNA',
'A heterodifunctionalised ferrocene derivative that self-assembles in solution through complementary hydrogen-bonding interactions',
'A locking X-ray window shutter and collimator coupling to comply with the new Health and Safety at Work Act',
'A luminescent europium hairpin for DNA photosensing in the visible, based on trimetallic bis-intercalators',
...
'Up-Conversion Device Based on Quantum Dots With High-Conversion Efficiency Over 6%',
'Vectorial Control of Energy‐Transfer Processes in Metallocyclodextrin Heterometallic Assemblies',
'Verteporfin selectively kills hypoxic glioma cells through iron-binding and increased production of reactive oxygen species',
'Vibrational Absorption from Oxygen-Hydrogen (Oi-H2) Complexes in Hydrogenated CZ Silicon',
'Virginia review of sociology',
'Wildlife use of log landings in the White Mountain National Forest',
'Yttrium 1995',
'ZUSCHRIFTEN-Redox-Switched Control of Binding Strength in Hydrogen-Bonded Metallocene Complexes Stichworter: Carbonsauren. Elektrochemie. Metallocene. Redoxchemie …',
'[2] Rotaxanes comprising a macrocylic Hamilton receptor obtained using active template synthesis: synthesis and guest complexation',
'pH-controlled delivery of luminescent europium coated nanoparticles into platelets'],
dtype='object', name='title', length=202)
Since all of the data is in Pandas, you can also explore what the attached authors on each of the papers is as well as all of the other information you have access to within the author.publications
array coming in from scholarly.
Upvotes: 3
Reputation: 24691
First, let's convert this into a more friendly format. You say that the id_citations
is unique for each paper, so we'll use it as a hashtable/dict key.
We can then map each id_citation
to the bib dict and author(s) it appears for, as a list of tuples (bib, author_name)
.
author_list = ['Zoe Pikramenou', 'James H. R. Tucker', 'Alison Rodger', 'Timothy Dafforn']
bibs = {}
for author_name in author_list:
search_query = scholarly.search_author(author_name)
for bib in search_query:
bib = bib.fill()
bibs.setdefault(bib['id_citations'], []).append((bib, author_name))
Thereafter, we can sort the keys in bibs
based on how many authors are attached to them:
most_cited = sorted(bibs.items(), key=lambda k: len(k[1]))
# most_cited is now a list of tuples (key, value)
# which maps to (id_citation, [(bib1, author1), (bib2, author2), ...])
and/or filter that list to citations that have only three or more appearances:
cited_enough = [tup[1][0][0] for tup in most_cited if len(tup[1]) >= 3]
# using key [0] in the middle is arbitrary. It can be anything in the
# list, provided the bib objects are identical, but index 0 is guaranteed
# to be there.
# otherwise, the first index is to grab the list rather than the id_citation,
# and the last index is to grab the bib, rather than the author_name
and now we can retrieve the titles of the papers from there:
paper_titles = [bib['bib']['title'] for bib in cited_enough]
Upvotes: 1