nad
nad

Reputation: 2850

Series of JSON objects to dataframe conversion

I have downloaded a sample dataset from here that is a series of JSON objects. According to the website, each JSON object looks like below

{
  "id": "4cd223df721b722b1c40689caa52932a41fcc223",
  "title": "Knowledge-rich, computer-assisted composition of Chinese couplets",
  "paperAbstract": "Recent research effort in poem composition has focused on the use of automatic language generation...",
  "entities": [
    "Conformance testing",
    "Natural language generation",
    "Natural language processing",
    "Parallel computing",
    "Stochastic grammar",
    "Web application"
  ],
  "s2Url": "https://semanticscholar.org/paper/4cd223df721b722b1c40689caa52932a41fcc223",
  "s2PdfUrl": "",
  "pdfUrls": [
    "https://doi.org/10.1093/llc/fqu052"
  ],
  "authors": [
    {
      "name": "John Lee",
      "ids": [
        "3362353"
      ]
    },
    "..."
  ],
  "inCitations": [
    "c789e333fdbb963883a0b5c96c648bf36b8cd242"
  ],
  "outCitations": [
    "abe213ed63c426a089bdf4329597137751dbb3a0",
    "..."
  ],
  "year": 2016,
  "venue": "DSH",
  "journalName": "DSH",
  "journalVolume": "31",
  "journalPages": "152-163",
  "sources": [
    "DBLP"
  ],
  "doi": "10.1093/llc/fqu052",
  "doiUrl": "https://doi.org/10.1093/llc/fqu052",
  "pmid": ""
}

Eventually I need to work with the paperAbsrtract section only. I am loading this to a pandas dataframe like below

filename = "sample-S2-records"
df = pd.read_json(filename, lines=True) 
df.head()

This shows all the doi and doiUrl column empty.

Also if I only select abstract column and check out the head, I see 2 of the 5 rows empty

abstract = df['paperAbstract']
abstract.head()

0                                                     
1    The search for new administrators in complex s...
2    The human N-formyl peptide receptor (FPR) is a...
3    Serum CA 19-9 (2-3 sialyl Le(a)) is a marker o...
4                                                     
Name: paperAbstract, dtype: object

Looks like the way I have created the dataframe is not the right approach. I am pretty confident they don't have any missing column.

What am I missing? Any suggestion?

Upvotes: 0

Views: 53

Answers (1)

Liudvikas Akelis
Liudvikas Akelis

Reputation: 1323

I looked into a sample of your data, and I think you're getting correct results. If we were to parse the JSON by hand:

import json
filename = "sample-S2-records"
with open(filename, 'r') as f:
    d = [json.loads(x) for x in f]

Then inspect the list of dictionaries, here's what we see:

>>> d[0]['paperAbstract']
''

So it really looks like the first lines paperAbstract field is empty.

P.S.: I think that the question needs to be closed, I doubt it's going to help anyone else

Upvotes: 1

Related Questions