Reputation: 1460
Here, i have a json format data, I wanted to get particular values as my column name, and the respective values.
DATA:
{
"552783667052167168": {
"552783667052167168": {
"contributors": null,
"truncated": false,
"text": "France: 10 people dead after shooting at HQ of satirical weekly newspaper #CharlieHebdo, according to witnesses ",
"in_reply_to_status_id": null,
"id": 552783667052167168,
}
"552785374507175936": {
"contributors": null,
"truncated": false,
"text": "MT @euronews France: 10 dead after shooting at HQ of satirical weekly #CharlieHebdo. If Zionists/Jews did this they'd be nuking Israel",
"in_reply_to_status_id": 552783667052167168,
"id": 552785374507175936,
}
"552786226546495488": {
"contributors": null,
"truncated": false,
"text": "@j0nathandavis They who? Stupid and partial opinions like this one only add noise to any debate.",
"in_reply_to_status_id": 552785374507175936,
"id": 552786226546495488
}
}
"552791196247269378": {
"552791196247269378": {
"contributors": null,
"truncated": false,
"text": "BREAKING: At least 10 killed in shooting at French satirical newspaper Charlie Hebdo, Paris prosecutor's office says. ,
"in_reply_to_status_id": null,
"id": 552791196247269378
}
"552791516360765440": {
"contributors": null,
"truncated": false,
"text": "@cnni 11 Killed now",
"in_reply_to_status_id": 552791196247269378,
"id": 552791516360765440
}
"552791567401238529": {
"contributors": null,
"truncated": false,
"text": "@cnni 11 died",
"in_reply_to_status_id": 552791196247269378,
"id": 552791567401238529
}
}
I wanted to have the respective columns mainID, and text as my columns.
There is one thing which is completed here is the first ID which is 552783667052167168
also has a text, if you see the format, {
"552783667052167168": {
"552783667052167168": {
so this will be my mainID, and the respective text as main text, and we build another two columns for the childs.
output:
ParentID parentText ChildID childText
552783667052167168 "France: 10 people dead 552785374507175936 "MT @euronews France: 10 dead after
552783667052167168 "France: 10 people dead 552786226546495488 "@j0nathandavis They who?
552791196247269378 "BREAKING: At least 10 killed 552791516360765440 "@cnni 11 Killed now"
552791196247269378 "BREAKING: At least 10 killed 552791567401238529 "@cnni 11 died"
Here, we will have "in_reply_to_status_id": null
as null if its a parent ID. I guess we can use this as a rule.
Edit one :
I was able to code it until here, but, the source tweet's text is still coming.
for sourceTweet, tweets in dataTrain.items():
#print(sourceTweet)
for tweet, tweetContent in tweets.items():
#print(tweet)
for iTweet, iTweetContent in tweets.items():
#print(iTweet)
if (sourceTweet==iTweet):
sourceTweetContent = iTweetContent
sourceTweetText = iTweetContent["text"]
break
for jTweet, jTweetContent in tweets.items():
#print(jTweet)
if (tweetContent["in_reply_to_status_id"]==jTweet):
replyToTweetContent = jTweetContent
replyToTweetText = jTweetContent["text"]
print(replyToTweetText)
break
Upvotes: 0
Views: 102
Reputation: 42906
This works, maybe not the most elegant way, but its a solution. Hope it helps:
# get the parent keys
parentkeys = list(json.keys())
# create lists to fill for columns later
parentids = []
childids = []
contributors = []
truncated = []
text = []
in_reply_to_status_id = []
id =[]
# get the data out the json
for parentkey in parentkeys:
for child in json[parentkey]:
parentids.append(parentkey)
childids.append(child)
contributors.append(json[parentkey][child]['contributors'])
truncated.append(json[parentkey][child]['truncated'])
text.append(json[parentkey][child]['text'])
in_reply_to_status_id.append(json[parentkey][child]['in_reply_to_status_id'])
id.append(json[parentkey][child]['id'])
# create the dataframe out the of the lists
df = pd.DataFrame({'ParentID':parentids,
'ChildID':childids,
'contributors':contributors,
'truncated':truncated,
'text':text,
'in_reply_to_status_id':in_reply_to_status_id,
'id':id})
So now we have to transform the dataframe
in the format you asked:
# copy the text as parent text if it doenst have a child id
df['parentText'] = np.where(df.in_reply_to_status_id == 'null', df.text, None)
# fill the rows below untill you hit a different value rowwise
df.fillna(method='ffill', axis=0, inplace=True)
# filter the rows which have the same parent and childid
df = df[df.ParentID != df.ChildID]
# rename the column to the name which was asked
df.rename(columns={'text':'childText'}, inplace=True)
# select the 4 columns which are needed
df = df[['ParentID', 'parentText', 'ChildID', 'childText']]
Output
ParentID parentText ChildID childText
1 552783667052167168 France: 10 people dead after shooting at HQ of... 552785374507175936 MT @euronews France: 10 dead after shooting at...
2 552783667052167168 France: 10 people dead after shooting at HQ of... 552786226546495488 @j0nathandavis They who? Stupid and partial op...
4 552791196247269378 BREAKING: At least 10 killed in shooting at Fr... 552791516360765440 @cnni 11 Killed now
5 552791196247269378 BREAKING: At least 10 killed in shooting at Fr... 552791567401238529 @cnni 11 died
EDIT
Your json gave errors in my console. I cleaned it up for you, please use this to test:
json = {
"552783667052167168": {
"552783667052167168": {
"contributors": "null",
"truncated": "false",
"text": "France: 10 people dead after shooting at HQ of satirical weekly newspaper #CharlieHebdo, according to witnesses",
"in_reply_to_status_id": "null",
"id": 552783667052167168
},
"552785374507175936": {
"contributors": "null",
"truncated": "false",
"text": "MT @euronews France: 10 dead after shooting at HQ of satirical weekly #CharlieHebdo. If Zionists/Jews did this they'd be nuking Israel",
"in_reply_to_status_id": 552783667052167168,
"id": 552785374507175936
},
"552786226546495488": {
"contributors": "null",
"truncated": "false",
"text": "@j0nathandavis They who? Stupid and partial opinions like this one only add noise to any debate.",
"in_reply_to_status_id": 552785374507175936,
"id": 552786226546495488
}
},
"552791196247269378": {
"552791196247269378": {
"contributors": "null",
"truncated": "false",
"text": "BREAKING: At least 10 killed in shooting at French satirical newspaper Charlie Hebdo, Paris prosecutor's office says.",
"in_reply_to_status_id": "null",
"id": 552791196247269378
},
"552791516360765440": {
"contributors": "null",
"truncated": "false",
"text": "@cnni 11 Killed now",
"in_reply_to_status_id": 552791196247269378,
"id": 552791516360765440
},
"552791567401238529": {
"contributors": "null",
"truncated": "false",
"text": "@cnni 11 died",
"in_reply_to_status_id": 552791196247269378,
"id": 552791567401238529
}
}
}
Upvotes: 1
Reputation: 3770
Try this!!
a = """{
"552783667052167168": {
"552783667052167168": {
"contributors": null,
"truncated": false,
"text": "France: 10 people dead after shooting at HQ of satirical weekly newspaper #CharlieHebdo, according to witnesses",
"in_reply_to_status_id": null,
"id": 552783667052167168
},
"552785374507175936": {
"contributors": null,
"truncated": false,
"text": "MT @euronews France: 10 dead after shooting at HQ of satirical weekly #CharlieHebdo. If Zionists/Jews did this they'd be nuking Israel",
"in_reply_to_status_id": 552783667052167168,
"id": 552785374507175936
},
"552786226546495488": {
"contributors": null,
"truncated": false,
"text": "@j0nathandavis They who? Stupid and partial opinions like this one only add noise to any debate.",
"in_reply_to_status_id": 552785374507175936,
"id": 552786226546495488
}
},
"552791196247269378": {
"552791196247269378": {
"contributors": null,
"truncated": false,
"text": "BREAKING: At least 10 killed in shooting at French satirical newspaper Charlie Hebdo, Paris prosecutor's office says." ,
"in_reply_to_status_id": null,
"id": 552791196247269378
},
"552791516360765440": {
"contributors": null,
"truncated": false,
"text": "@cnni 11 Killed now",
"in_reply_to_status_id": 552791196247269378,
"id": 552791516360765440
},
"552791567401238529": {
"contributors": null,
"truncated": false,
"text": "@cnni 11 died",
"in_reply_to_status_id": 552791196247269378,
"id": 552791567401238529
}
}
}"""
Code
data = json.loads(a)
df = pd.DataFrame(columns=['ParentId','parentText','ChildId','childText'])
l = []
pos = 0
for a in data:
for d in data[a]:
if d == a:
l.append(a)
l.append(data[a][d]['text'])
else:
l.append(d)
l.append(data[a][d]['text'])
df.loc[pos] = l
l.remove(d)
l.remove(data[a][d]['text'])
pos+=1
l = []
Output
ParentId parentText \
0 552783667052167168 France: 10 people dead after shooting at HQ of...
1 552783667052167168 France: 10 people dead after shooting at HQ of...
2 552791196247269378 BREAKING: At least 10 killed in shooting at Fr...
3 552791196247269378 BREAKING: At least 10 killed in shooting at Fr...
ChildId childText
0 552785374507175936 MT @euronews France: 10 dead after shooting at...
1 552786226546495488 @j0nathandavis They who? Stupid and partial op...
2 552791516360765440 @cnni 11 Killed now
3 552791567401238529 @cnni 11 died
Upvotes: 1