mOna
mOna

Reputation: 2459

Problem when using Pandas json_normalize to flatten the nested JSON

I have a JSON file with the following structure (below is the content of retweets[:2]):

[{'lang': 'en',
  'author_id': '1076979440372965377',
  'reply_settings': 'everyone',
  'entities': {'mentions': [{'start': 3,
     'end': 17,
     'username': 'Terry81987010',
     'url': '',
     'location': 'Florida',
     'entities': {'description': {'hashtags': [{'start': 29,
         'end': 32,
         'tag': '2A'}]}},
     'created_at': '2019-02-01T23:01:11.000Z',
     'protected': False,
     'public_metrics': {'followers_count': 520,
      'following_count': 567,
      'tweet_count': 34376,
      'listed_count': 1},
     'name': "Terry's Take",
     'verified': False,
     'id': '1091471553437593605',
     'description': 'Less government more Freedom #2A is a constitutional right. Trump2020, common sense rules, God bless America! Vet 82nd Airborne F/A, proud Republican',
     'profile_image_url': 'https://pbs.twimg.com/profile_images/1289626661911134208/WfztLkr1_normal.jpg'},
    {'start': 19,
     'end': 32,
     'username': 'DineshDSouza',
     'location': 'United States',
     'entities': {'url': {'urls': [{'start': 0,
         'end': 23,
         'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
         'display_url': 'podcasts.apple.com/us/podcast/the…'}]},
      'description': {'urls': [{'start': 80,
         'end': 103,
         'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
         'display_url': 'podcasts.apple.com/us/podcast/the…'}]}},
     'created_at': '2009-11-22T22:32:41.000Z',
     'protected': False,
     'public_metrics': {'followers_count': 1748832,
      'following_count': 5355,
      'tweet_count': 65674,
      'listed_count': 6966},
     'name': "Dinesh D'Souza",
     'verified': True,
     'pinned_tweet_id': '1393309917239562241',
     'id': '91882544',
     'description': "I am an author, filmmaker, and host of the Dinesh D'Souza Podcast.\n\nSubscribe: ",
     'profile_image_url': 'https://pbs.twimg.com/profile_images/890967538292711424/8puyFbiI_normal.jpg'}]},
  'conversation_id': '1253462541881106433',
  'created_at': '2020-04-23T23:15:32.000Z',
  'id': '1253462541881106433',
  'possibly_sensitive': False,
  'referenced_tweets': [{'type': 'retweeted',
    'id': '1253052684489437184',
    'in_reply_to_user_id': '91882544',
    'attachments': {'media_keys': ['3_1253052312144293888',
      '3_1253052620937277442'],
     'media': [{}, {}]},
    'entities': {'annotations': [{'start': 126,
       'end': 128,
       'probability': 0.514,
       'type': 'Organization',
       'normalized_text': 'CDC'},
      {'start': 145,
       'end': 146,
       'probability': 0.5139,
       'type': 'Place',
       'normalized_text': 'NY'}],
     'mentions': [{'start': 0,
       'end': 13,
       'username': 'DineshDSouza',
       'location': 'United States',
       'entities': {'url': {'urls': [{'start': 0,
           'end': 23,
           'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
           'display_url': 'podcasts.apple.com/us/podcast/the…'}]},
        'description': {'urls': [{'start': 80,
           'end': 103,
           'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
           'display_url': 'podcasts.apple.com/us/podcast/the…'}]}},
       'created_at': '2009-11-22T22:32:41.000Z',
       'protected': False,
       'public_metrics': {'followers_count': 1748832,
        'following_count': 5355,
        'tweet_count': 65674,
        'listed_count': 6966},
       'name': "Dinesh D'Souza",
       'verified': True,
       'pinned_tweet_id': '1393309917239562241',
       'id': '91882544',
       'description': "I am an author, filmmaker, and host of the Dinesh D'Souza Podcast.\n\nSubscribe:  ",
       'profile_image_url': 'https://pbs.twimg.com/profile_images/890967538292711424/8puyFbiI_normal.jpg'}],
     'urls': [{'start': 187,
       'end': 210,
       'expanded_url': 'https://twitter.com/Terry81987010/status/1253052684489437184/photo/1',
       'display_url': 'pic.twitter.com/H4NpN5ZMkW'},
      {'start': 187,
       'end': 210,
       'expanded_url': 'https://twitter.com/Terry81987010/status/1253052684489437184/photo/1',
       'display_url': 'pic.twitter.com/H4NpN5ZMkW'}]},
    'lang': 'en',
    'author_id': '1091471553437593605',
    'reply_settings': 'everyone',
    'conversation_id': '1253050942716551168',
    'created_at': '2020-04-22T20:06:55.000Z',
    'possibly_sensitive': False,
    'referenced_tweets': [{'type': 'replied_to', 'id': '1253050942716551168'}],
    'public_metrics': {'retweet_count': 208,
     'reply_count': 57,
     'like_count': 402,
     'quote_count': 38},
    'source': 'Twitter Web App',
    'text': "@DineshDSouza Here's some proof of artificially inflating the cv deaths. Noone is dying of pneumonia anymore according to the CDC. And of course NY getting paid for each cv death $60,000",
    'context_annotations': [{'domain': {'id': '10',
       'name': 'Person',
       'description': 'Named people in the world like Nelson Mandela'},
      'entity': {'id': '1138120064119369729', 'name': "Dinesh D'Souza"}},
     {'domain': {'id': '35',
       'name': 'Politician',
       'description': 'Politicians in the world, like Joe Biden'},
      'entity': {'id': '1138120064119369729', 'name': "Dinesh D'Souza"}}],
    'author': {'url': '',
     'username': 'Terry81987010',
     'location': 'Florida',
     'entities': {'description': {'hashtags': [{'start': 29,
         'end': 32,
         'tag': '2A'}]}},
     'created_at': '2019-02-01T23:01:11.000Z',
     'protected': False,
     'public_metrics': {'followers_count': 520,
      'following_count': 567,
      'tweet_count': 34376,
      'listed_count': 1},
     'name': "Terry's Take",
     'verified': False,
     'id': '1091471553437593605',
     'description': 'Less government more Freedom #2A is a constitutional right. Trump2020, common sense rules, God bless America! Vet 82nd Airborne F/A, proud Republican',
     'profile_image_url': 'https://pbs.twimg.com/profile_images/1289626661911134208/WfztLkr1_normal.jpg'},
    'in_reply_to_user': {'username': 'DineshDSouza',
     'location': 'United States',
     'entities': {'url': {'urls': [{'start': 0,
         'end': 23,
         'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
         'display_url': 'podcasts.apple.com/us/podcast/the…'}]},
      'description': {'urls': [{'start': 80,
         'end': 103,
         'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
         'display_url': 'podcasts.apple.com/us/podcast/the…'}]}},
     'created_at': '2009-11-22T22:32:41.000Z',
     'protected': False,
     'public_metrics': {'followers_count': 1748832,
      'following_count': 5355,
      'tweet_count': 65674,
      'listed_count': 6966},
     'name': "Dinesh D'Souza",
     'verified': True,
     'pinned_tweet_id': '1393309917239562241',
     'id': '91882544',
     'description': "I am an author, filmmaker, and host of the Dinesh D'Souza Podcast.\n\nSubscribe:  ",
     'profile_image_url': 'https://pbs.twimg.com/profile_images/890967538292711424/8puyFbiI_normal.jpg'}}],
  'public_metrics': {'retweet_count': 208,
   'reply_count': 0,
   'like_count': 0,
   'quote_count': 0},
  'source': 'Twitter for iPhone',
  'text': "RT @Terry81987010: @DineshDSouza Here's some proof of artificially inflating the cv deaths. Noone is dying of pneumonia anymore according t…",
  'context_annotations': [{'domain': {'id': '10',
     'name': 'Person',
     'description': 'Named people in the world like Nelson Mandela'},
    'entity': {'id': '1138120064119369729', 'name': "Dinesh D'Souza"}},
   {'domain': {'id': '35',
     'name': 'Politician',
     'description': 'Politicians in the world, like Joe Biden'},
    'entity': {'id': '1138120064119369729', 'name': "Dinesh D'Souza"}}],
  'author': {'url': '',
   'username': 'set1952',
   'location': 'Etats-Unis',
   'created_at': '2018-12-23T23:14:42.000Z',
   'protected': False,
   'public_metrics': {'followers_count': 103,
    'following_count': 44,
    'tweet_count': 44803,
    'listed_count': 0},
   'name': 'SunSet1952',
   'verified': False,
   'id': '1076979440372965377',
   'description': '',
   'profile_image_url': 'https://abs.twimg.com/sticky/default_profile_images/default_profile_normal.png'},
  '__twarc': {'url': 'https://api.twitter.com/2/tweets/search/all?expansions=author_id%2Cin_reply_to_user_id%2Creferenced_tweets.id%2Creferenced_tweets.id.author_id%2Centities.mentions.username%2Cattachments.poll_ids%2Cattachments.media_keys%2Cgeo.place_id&user.fields=created_at%2Cdescription%2Centities%2Cid%2Clocation%2Cname%2Cpinned_tweet_id%2Cprofile_image_url%2Cprotected%2Cpublic_metrics%2Curl%2Cusername%2Cverified%2Cwithheld&tweet.fields=attachments%2Cauthor_id%2Ccontext_annotations%2Cconversation_id%2Ccreated_at%2Centities%2Cgeo%2Cid%2Cin_reply_to_user_id%2Clang%2Cpublic_metrics%2Ctext%2Cpossibly_sensitive%2Creferenced_tweets%2Creply_settings%2Csource%2Cwithheld&media.fields=duration_ms%2Cheight%2Cmedia_key%2Cpreview_image_url%2Ctype%2Curl%2Cwidth%2Cpublic_metrics&poll.fields=duration_minutes%2Cend_datetime%2Cid%2Coptions%2Cvoting_status&place.fields=contained_within%2Ccountry%2Ccountry_code%2Cfull_name%2Cgeo%2Cid%2Cname%2Cplace_type&max_results=500&query=retweets_of%3ATerry81987010&start_time=2020-03-09T00%3A00%3A00%2B00%3A00&end_time=2020-04-24T00%3A00%3A00%2B00%3A00',
   'version': '2.0.8',
   'retrieved_at': '2021-05-17T17:13:17+00:00'}},
 {'attachments': {'media_keys': ['3_1253366603061878792'],
   'media': [{'media_key': '3_1253366603061878792',
     'type': 'photo',
     'height': 1920,
     'width': 1080,
     'url': 'https://pbs.twimg.com/media/EWTaSOzXYAgBEqO.jpg'}]},
  'lang': 'en',
  'author_id': '157054921',
  'reply_settings': 'everyone',
  'entities': {'mentions': [{'start': 3,
     'end': 17,
     'username': 'Terry81987010',
     'url': '',
     'location': 'Florida',
     'entities': {'description': {'hashtags': [{'start': 29,
         'end': 32,
         'tag': '2A'}]}},
     'created_at': '2019-02-01T23:01:11.000Z',
     'protected': False,
     'public_metrics': {'followers_count': 520,
      'following_count': 567,
      'tweet_count': 34376,
      'listed_count': 1},
     'name': "Terry's Take",
     'verified': False,
     'id': '1091471553437593605',
     'description': 'Less government more Freedom #2A is a constitutional right. Trump2020, common sense rules, God bless America! Vet 82nd Airborne F/A, proud Republican',
     'profile_image_url': 'https://pbs.twimg.com/profile_images/1289626661911134208/WfztLkr1_normal.jpg'},
    {'start': 19, 'end': 31, 'username': 'mitchellvii'}],
   'urls': [{'start': 60,
     'end': 83,
     'expanded_url': 'https://twitter.com/Terry81987010/status/1253366701112070147/photo/1',
     'display_url': 'pic.twitter.com/mQLE5TZviE'}]},
  'conversation_id': '1253446311753453576',
  'created_at': '2020-04-23T22:11:03.000Z',
  'id': '1253446311753453576',
  'possibly_sensitive': False,
  'referenced_tweets': [{'type': 'retweeted',
    'id': '1253366701112070147',
    'in_reply_to_user_id': '17980523',
    'attachments': {'media_keys': ['3_1253366603061878792'],
     'media': [{'media_key': '3_1253366603061878792',
       'type': 'photo',
       'height': 1920,
       'width': 1080,
       'url': 'https://pbs.twimg.com/media/EWTaSOzXYAgBEqO.jpg'}]},
    'lang': 'en',
    'author_id': '1091471553437593605',
    'reply_settings': 'everyone',
    'entities': {'mentions': [{'start': 0,
       'end': 12,
       'username': 'mitchellvii'}],
     'urls': [{'start': 41,
       'end': 64,
       'expanded_url': 'https://twitter.com/Terry81987010/status/1253366701112070147/photo/1',
       'display_url': 'pic.twitter.com/mQLE5TZviE'}]},
    'conversation_id': '1253364124416978951',
    'created_at': '2020-04-23T16:54:42.000Z',
    'possibly_sensitive': False,
    'referenced_tweets': [{'type': 'replied_to', 'id': '1253364124416978951'}],
    'public_metrics': {'retweet_count': 4,
     'reply_count': 1,
     'like_count': 4,
     'quote_count': 0},
    'source': 'Twitter Web App',
    'text': '@mitchellvii Is it really a coincidence?',
    'context_annotations': [{'domain': {'id': '10',
       'name': 'Person',
       'description': 'Named people in the world like Nelson Mandela'},
      'entity': {'id': '1138048048972517376',
       'name': 'Bill Mitchell',
       'description': 'Professor'}}],
    'author': {'url': '',
     'username': 'Terry81987010',
     'location': 'Florida',
     'entities': {'description': {'hashtags': [{'start': 29,
         'end': 32,
         'tag': '2A'}]}},
     'created_at': '2019-02-01T23:01:11.000Z',
     'protected': False,
     'public_metrics': {'followers_count': 520,
      'following_count': 567,
      'tweet_count': 34376,
      'listed_count': 1},
     'name': "Terry's Take",
     'verified': False,
     'id': '1091471553437593605',
     'description': 'Less government more Freedom #2A is a constitutional right. Trump2020, common sense rules, God bless America! Vet 82nd Airborne F/A, proud Republican',
     'profile_image_url': 'https://pbs.twimg.com/profile_images/1289626661911134208/WfztLkr1_normal.jpg'},
    'in_reply_to_user': {}}],
  'public_metrics': {'retweet_count': 4,
   'reply_count': 0,
   'like_count': 0,
   'quote_count': 0},
  'source': 'Twitter for iPhone',
  'text': 'RT @Terry81987010: @mitchellvii Is it really a coincidence?',
  'context_annotations': [{'domain': {'id': '10',
     'name': 'Person',
     'description': 'Named people in the world like Nelson Mandela'},
    'entity': {'id': '1138048048972517376',
     'name': 'Bill Mitchell',
     'description': 'Professor'}}],
  'author': {'url': '',
   'username': 'BillyBoysDaddy',
   'entities': {'description': {'hashtags': [{'start': 93,
       'end': 98,
       'tag': 'MAGA'}]}},
   'created_at': '2010-06-18T18:29:28.000Z',
   'protected': False,
   'public_metrics': {'followers_count': 10510,
    'following_count': 9001,
    'tweet_count': 444455,
    'listed_count': 171},
   'name': 'Sonlight 🇺🇸',
   'verified': False,
   'id': '157054921',
   'description': 'The way that it is ... is ... the way that it is ...... and it could not be otherwise ...... #MAGA',
   'profile_image_url': 'https://pbs.twimg.com/profile_images/3464880142/bb34db1c7344eca5bd8ce7196ff0e003_normal.jpeg'},
  '__twarc': {'url': 'https://api.twitter.com/2/tweets/search/all?expansions=author_id%2Cin_reply_to_user_id%2Creferenced_tweets.id%2Creferenced_tweets.id.author_id%2Centities.mentions.username%2Cattachments.poll_ids%2Cattachments.media_keys%2Cgeo.place_id&user.fields=created_at%2Cdescription%2Centities%2Cid%2Clocation%2Cname%2Cpinned_tweet_id%2Cprofile_image_url%2Cprotected%2Cpublic_metrics%2Curl%2Cusername%2Cverified%2Cwithheld&tweet.fields=attachments%2Cauthor_id%2Ccontext_annotations%2Cconversation_id%2Ccreated_at%2Centities%2Cgeo%2Cid%2Cin_reply_to_user_id%2Clang%2Cpublic_metrics%2Ctext%2Cpossibly_sensitive%2Creferenced_tweets%2Creply_settings%2Csource%2Cwithheld&media.fields=duration_ms%2Cheight%2Cmedia_key%2Cpreview_image_url%2Ctype%2Curl%2Cwidth%2Cpublic_metrics&poll.fields=duration_minutes%2Cend_datetime%2Cid%2Coptions%2Cvoting_status&place.fields=contained_within%2Ccountry%2Ccountry_code%2Cfull_name%2Cgeo%2Cid%2Cname%2Cplace_type&max_results=500&query=retweets_of%3ATerry81987010&start_time=2020-03-09T00%3A00%3A00%2B00%3A00&end_time=2020-04-24T00%3A00%3A00%2B00%3A00',
   'version': '2.0.8',
   'retrieved_at': '2021-05-17T17:13:17+00:00'}}]

What I need: I have to import the JSON file into the Pandas dataframe in a way that there would be a separate column for each field (i.e., completely flatten the nested lists). For example, instead of just one column for referenced_tweets which includes type, id, etc., there should be separate columns for each of them.

Here is my code:

retweets = []
for line in open('Data/usersRetweetsFlatten_sample.json', 'r'):
    retweets.append(json.loads(line))

df_retweets = json_normalize(
    retweets, 
    record_path=['referenced_tweets'], 
    meta_prefix=".", 
    meta= ["text", "created_at", "author_id", "conversation_id", "id"], 
    errors='ignore'
)

df_retweets[['author.username', 'type', 'id', '.id', 'referenced_tweets']].head()

PROBLEM: When I run the last line of code (df_retweets.head()), I don't know why the values of id and type columns don't match the corresponding values of type and id in the referenced_tweets column. For example, the value of the column type is always retweeted while the type in the referenced_tweets column is replied_to!!!! Please see the attached screenshot of the resulting dataframe.

enter image description here

Upvotes: 0

Views: 213

Answers (1)

not_speshal
not_speshal

Reputation: 23146

The output is as expected and in line with the structure of your json, which is quite convoluted (multiple levels).

The "type" column is deriving it's values from the one level.

>>> retweets[0]["referenced_tweets"][0]["type"]
"retweeted"

Whilst the "referenced_tweets" column is coming from a more nested (i.e., deeper) level:

>>> retweets[0]["referenced_tweets"][0]["referenced_tweets"][0]["type"]
'replied_to'

Upvotes: 1

Related Questions