Vijay Selvaraju
Vijay Selvaraju

Reputation: 1

Combining split cells in Openrefine

I've been trying to build an excel sheet about all the papers published by staffs and students of my university. I used Scopus API to retrieve all the information like Author, Title and publish dates and it worked perfectly.

Since the retrieved data was a JSON File I had to convert it to Excel, So I chose OpenRefine and when I converted the file it created multiple rows if the paper had more than one writers

For example Like Sample Scopus

And My JSON response looks like

{
abstracts-retrieval-response: {
coredata: {
citedby-count: 0,
prism:volume: 430-431,
prism:pageRange: 240-246,
prism:coverDate: 2018-03-01,
dc:title: Solving the 3-COL problem by using tissue P systems without environment and proteins on cells,
prism:aggregationType: Journal,
prism:doi: 10.1016/j.ins.2017.11.022,
prism:publicationName: Information Sciences
},
authors: {
author: [
{
ce:given-name: Daniel,
preferred-name: {
ce:given-name: Daniel,
ce:initials: D.,
ce:surname: Díaz-Pernil,
ce:indexed-name: Díaz-Pernil D.
},
@seq: 1,
ce:initials: D.,
@_fa: true,
affiliation: {
@id: 60033284,
@href: http://api.elsevier.com/content/affiliation/affiliation_id/60033284
},
ce:surname: Díaz-Pernil,
@auid: 16645195100,
author-url: http://api.elsevier.com/content/author/author_id/16645195100,
ce:indexed-name: Diaz-Pernil D.
},
{
ce:given-name: Hepzibah A.,
preferred-name: {
ce:given-name: Hepzibah A.,
ce:initials: H.A.,
ce:surname: Christinal,
ce:indexed-name: Christinal H.
},
@seq: 2,
ce:initials: H.A.,
@_fa: true,
affiliation: {
@id: 60100082,
@href: http://api.elsevier.com/content/affiliation/affiliation_id/60100082
},
ce:surname: Christinal,
@auid: 57197875639,
author-url: http://api.elsevier.com/content/author/author_id/57197875639,
ce:indexed-name: Christinal H.A.
},
{
ce:given-name: Miguel A.,
preferred-name: {
ce:given-name: Miguel A.,
ce:initials: M.A.,
ce:surname: Gutiérrez-Naranjo,
ce:indexed-name: Gutiérrez-Naranjo M.
},
@seq: 3,
ce:initials: M.A.,
@_fa: true,
affiliation: {
@id: 60033284,
@href: http://api.elsevier.com/content/affiliation/affiliation_id/60033284
},
ce:surname: Gutiérrez-Naranjo,
@auid: 6506630834,
author-url: http://api.elsevier.com/content/author/author_id/6506630834,
ce:indexed-name: Gutierrez-Naranjo M.A.
}
]
}
}
}

So how do I combine all the authors into a single cell according to the Title?

Upvotes: 0

Views: 522

Answers (1)

Owen Stephens
Owen Stephens

Reputation: 1560

After importing the JSON into OpenRefine, you need to organise the project into Records. See http://kb.refinepro.com/2012/03/difference-between-record-and-row.html for an explanation of the difference between 'rows' and 'records' in OpenRefine.

To get the project into records you need to move a column containing information that will only appear once in each record (e.g. the title column - which maybe labelled something like "_ - abstracts-retrieval-response - coredata - dc:title" based on the JSON you've pasted here) to the start of the project. See http://kb.refinepro.com/2012/06/create-records-in-google-refine.html for more information on creating records in OpenRefine.

Once you have done this, switch to the 'records' view (click the 'records' link towards the top left of the data table) and then do as @Ettore-Rizza mentions in his comment - pick the column containing the names you want to use (e.g. "_ - abstracts-retrieval-response - authors - author - _ - ce:indexed-name" column) and use the Edit cells -> Join Multi-valued Cells option from the drop down menu at the top of the column.

Because each author related to the article is described in the JSON with multiple fields including various name forms plus a URL) you'll need to either remove the other columns containing author info, or merge the multiple values into a single field using the 'Join Multi-value cells option on all the affected columns (unless you need to retain this information, it is much easier to remove the unwanted columns)

Once this is done, and assuming there are no other fields which have repeated data in the record, you should have a single row per title.

Upvotes: 2

Related Questions