Python for matching paper IDs in Scholarly

Question

I have a list of the following authors for Google Scholar papers: Zoe Pikramenou, James H. R. Tucker, Alison Rodger, Timothy Dafforn. I want to extract and print titles for the papers present for at least 3 of these.

You can get a dictionary of paper info from each author using Scholarly:

from scholarly import scholarly
AuthorList = ['Zoe Pikramenou', 'James H. R. Tucker', 'Alison Rodger', 'Timothy Dafforn']
for Author in AuthorList:
    search_query = scholarly.search_author(Author)
    author = next(search_query).fill()
    print(author)

The output looks something like (just a small excerpt from what you'd get from one author)

                  {'bib': {'cites': '69',
         'title': 'Chalearn looking at people and faces of the world: Face '
                  'analysis workshop and challenge 2016',
         'year': '2016'},
 'filled': False,
 'id_citations': 'ZhUEBpsAAAAJ:_FxGoFyzp5QC',
 'source': 'citations'},
                  {'bib': {'cites': '21',
         'title': 'The NoXi database: multimodal recordings of mediated '
                  'novice-expert interactions',
         'year': '2017'},
 'filled': False,
 'id_citations': 'ZhUEBpsAAAAJ:0EnyYjriUFMC',
 'source': 'citations'},
                  {'bib': {'cites': '11',
         'title': 'Automatic habitat classification using image analysis and '
                  'random forest',
         'year': '2014'},
 'filled': False,
 'id_citations': 'ZhUEBpsAAAAJ:qjMakFHDy7sC',
 'source': 'citations'},
                  {'bib': {'cites': '10',
         'title': 'AutoRoot: open-source software employing a novel image '
                  'analysis approach to support fully-automated plant '
                  'phenotyping',
         'year': '2017'},
 'filled': False,
 'id_citations': 'ZhUEBpsAAAAJ:hqOjcs7Dif8C',
 'source': 'citations'}

How can I collect the bib and specifically title for papers which are present for three or more out of the four authors?

EDIT: in fact it's been pointed out id_citations is not unique for each paper, my mistake. Better to just use title itself

Philip Ciunkiewicz · Accepted Answer

Expanding on my comment, you can achieve this using Pandas groupby:

import pandas as pd
from scholarly import scholarly

AuthorList = ['Zoe Pikramenou', 'James H. R. Tucker', 'Alison Rodger', 'Timothy Dafforn']
frames = []

for Author in AuthorList:
    search_query = scholarly.search_author(Author)
    author = next(search_query).fill()
    # creating DataFrame with authors
    df = pd.DataFrame([x.__dict__ for x in author.publications])
    df['author'] = Author
    frames.append(df.copy())

# joining all author DataFrames
df = pd.concat(frames, axis=0)

# taking bib dict into separate columns
df[['title', 'cites', 'year']] = pd.DataFrame(df.bib.to_list())

# counting unique authors attached to each title
n_authors = df.groupby('title').author.nunique()
# locating the unique titles for all publications with n_authors >= 2
output = n_authors[n_authors >= 2].index

This finds 202 papers which have 2 or more of the authors in that list (out of 774 total papers). Here is an example of the output:

Index(['1, 1′-Homodisubstituted ferrocenes containing adenine and thymine nucleobases: synthesis, electrochemistry, and formation of H-bonded arrays',
       '722: Iron chelation by biopolymers for an anti-cancer therapy; binding up the'ferrotoxicity'in the colon',
       'A Luminescent One-Dimensional Copper (I) Polymer',
       'A Unidirectional Energy Transfer Cascade Process in a Ruthenium Junction Self-Assembled by r-and-Cyclodextrins',
       'A Zinc(II)-Cyclen Complex Attached to an Anthraquinone Moiety that Acts as a Redox-Active Nucleobase Receptor in Aqueous Solution',
       'A ditopic ferrocene receptor for anions and cations that functions as a chromogenic molecular switch',
       'A ferrocene nucleic acid oligomer as an organometallic structural mimic of DNA',
       'A heterodifunctionalised ferrocene derivative that self-assembles in solution through complementary hydrogen-bonding interactions',
       'A locking X-ray window shutter and collimator coupling to comply with the new Health and Safety at Work Act',
       'A luminescent europium hairpin for DNA photosensing in the visible, based on trimetallic bis-intercalators',
       ...
       'Up-Conversion Device Based on Quantum Dots With High-Conversion Efficiency Over 6%',
       'Vectorial Control of Energy‐Transfer Processes in Metallocyclodextrin Heterometallic Assemblies',
       'Verteporfin selectively kills hypoxic glioma cells through iron-binding and increased production of reactive oxygen species',
       'Vibrational Absorption from Oxygen-Hydrogen (Oi-H2) Complexes in Hydrogenated CZ Silicon',
       'Virginia review of sociology',
       'Wildlife use of log landings in the White Mountain National Forest',
       'Yttrium 1995',
       'ZUSCHRIFTEN-Redox-Switched Control of Binding Strength in Hydrogen-Bonded Metallocene Complexes Stichworter: Carbonsauren. Elektrochemie. Metallocene. Redoxchemie …',
       '[2] Rotaxanes comprising a macrocylic Hamilton receptor obtained using active template synthesis: synthesis and guest complexation',
       'pH-controlled delivery of luminescent europium coated nanoparticles into platelets'],
      dtype='object', name='title', length=202)

Since all of the data is in Pandas, you can also explore what the attached authors on each of the papers is as well as all of the other information you have access to within the author.publications array coming in from scholarly.

Python for matching paper IDs in Scholarly

Answers (2)

Related Questions