Reputation: 2850
I have downloaded a sample dataset from here that is a series of JSON objects. According to the website, each JSON object looks like below
{
"id": "4cd223df721b722b1c40689caa52932a41fcc223",
"title": "Knowledge-rich, computer-assisted composition of Chinese couplets",
"paperAbstract": "Recent research effort in poem composition has focused on the use of automatic language generation...",
"entities": [
"Conformance testing",
"Natural language generation",
"Natural language processing",
"Parallel computing",
"Stochastic grammar",
"Web application"
],
"s2Url": "https://semanticscholar.org/paper/4cd223df721b722b1c40689caa52932a41fcc223",
"s2PdfUrl": "",
"pdfUrls": [
"https://doi.org/10.1093/llc/fqu052"
],
"authors": [
{
"name": "John Lee",
"ids": [
"3362353"
]
},
"..."
],
"inCitations": [
"c789e333fdbb963883a0b5c96c648bf36b8cd242"
],
"outCitations": [
"abe213ed63c426a089bdf4329597137751dbb3a0",
"..."
],
"year": 2016,
"venue": "DSH",
"journalName": "DSH",
"journalVolume": "31",
"journalPages": "152-163",
"sources": [
"DBLP"
],
"doi": "10.1093/llc/fqu052",
"doiUrl": "https://doi.org/10.1093/llc/fqu052",
"pmid": ""
}
Eventually I need to work with the paperAbsrtract
section only. I am loading this to a pandas dataframe like below
filename = "sample-S2-records"
df = pd.read_json(filename, lines=True)
df.head()
This shows all the doi
and doiUrl
column empty.
Also if I only select abstract column and check out the head, I see 2 of the 5 rows empty
abstract = df['paperAbstract']
abstract.head()
0
1 The search for new administrators in complex s...
2 The human N-formyl peptide receptor (FPR) is a...
3 Serum CA 19-9 (2-3 sialyl Le(a)) is a marker o...
4
Name: paperAbstract, dtype: object
Looks like the way I have created the dataframe is not the right approach. I am pretty confident they don't have any missing column.
What am I missing? Any suggestion?
Upvotes: 0
Views: 53
Reputation: 1323
I looked into a sample of your data, and I think you're getting correct results. If we were to parse the JSON by hand:
import json
filename = "sample-S2-records"
with open(filename, 'r') as f:
d = [json.loads(x) for x in f]
Then inspect the list of dictionaries, here's what we see:
>>> d[0]['paperAbstract']
''
So it really looks like the first lines paperAbstract
field is empty.
P.S.: I think that the question needs to be closed, I doubt it's going to help anyone else
Upvotes: 1