farheen
farheen

Reputation: 111

Convert JSON to CSV with pandas

I have a JSON file which contains 46k+ tweets in english and other languages as well which I want to save as csv file. Below is a part of json file.

    [{"user_id": 938118866135343104, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @PTISPOfficial: پاکستان تحریک انصاف کے وائس چیئرمین شاہ محمود قریشی  بغیر کسی پروٹوکول کے پاکستان سپر لیگ کا میچ دیکھنے کے لئے اسٹیڈیم م…", "tweet_id": 976166125502427136}
{"user_id": 959235642, "date_time": "03/20/2018 18:38:35", "tweet_content": "At last, Pakistan Have Witnessed The Most Thrilling Match Of Cricket In Pakistan, The Home. \n\n#PZvQG \n#ABC", "tweet_id": 976166125535973378}
{"user_id": 395163528, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @thePSLt20: SIX! 19.4 Liam Dawson to Anwar Ali\nWatch ball by ball highlights at (link removed)\n\n#PZvQG #HBLPSL #PSL2018 @_crici…", "tweet_id": 976166126202839040}
{"user_id": 3117825702, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @JeremyMcLellan: Rumor has it Amir Liaquat isn’t allowed to play in #PSL2018 because he keeps switching teams every week.", "tweet_id": 976166126483902466}
{"user_id": 3310967346, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @daniel86cricket: Peshawar beat Quetta by 1 run in one of the best T20 thrillers. PSL played in front of full house in Lahore Pakistan i…", "tweet_id": 976166126559354880}
{"user_id": 701494826194354179, "date_time": "03/20/2018 18:38:35", "tweet_content": "I wanted a super over😭\n#PZvQG", "tweet_id": 976166126836178944}
{"user_id": 347132028, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @hinaparvezbutt: Congratulations Peshawar Zalmi over great win but Quetta Gladiators won our hearts ♥️  #PZvQG", "tweet_id": 976166126685171713}
{"user_id": 3461853618, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @walterMiitty: It's harder than I thought to tell the truth\nIt's gonna leave you in pieces\nAll alone with your demons\nAnd I know that we…", "tweet_id": 976166126924201986}]

I followed this solution to convert it into CSV but got invalid syntax error on an urdu tweet. I also tried this:

    import json
with open("PeshVsQuetta.json") as f:
all_tweets = []
for line in f:
    text_dict = json.loads(line)
    all_tweets.append(text_dict)

print(all_tweets[0]['tweet_content'])

which gives me following error.

    UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 148: character maps to <undefined>

I even saved json file as txt file and tried this:

    import pandas as pd
    from ast import literal_eval
    columns = ['Tweet ID','Author ID','Tweet','Time']
    df1 = pd.DataFrame(columns = columns)
    f = open('PeshvsQuetta.txt',encoding = 'utf-8')
    counter = 1
    for line in f:
         if(counter != 1):
             s1 = literal_eval(line)
             ser = pd.Series([s1['tweet_id'],s1['user_id'],s1['tweet_content'],s1["date_time"]],index=['Tweet ID','Author ID','Tweet','Time'])
             df1 = df1.append(ser,ignore_index=True)
    counter = counter + 1
    df1.to_csv('PeshVsQuetta1.csv', encoding='utf-8',index=False,columns = columns)

But the resulting csv file has each series saved in one cell and it has lots of empty rows and some tweets are saved over multiple rows. Below is the image.

enter image description here Any help would be truly appreciated.

Upvotes: 0

Views: 39080

Answers (1)

Martin Evans
Martin Evans

Reputation: 46759

You should just be able to use Pandas as follows:

import pandas as pd

with open('PeshVsQuetta.json', encoding='utf-8-sig') as f_input:
    df = pd.read_json(f_input)

df.to_csv('PeshVsQuetta.csv', encoding='utf-8', index=False)

This assumes that your JSON file contains a BOM at the start. For the data you have given above, this produces the following CSV file:

date_time,tweet_content,tweet_id,user_id
2018-03-20 18:38:35,RT @PTISPOfficial: پاکستان تحریک انصاف کے وائس چیئرمین شاہ محمود قریشی  بغیر کسی پروٹوکول کے پاکستان سپر لیگ کا میچ دیکھنے کے لئے اسٹیڈیم م…,976166125502427136,938118866135343104
2018-03-20 18:38:35,"At last, Pakistan Have Witnessed The Most Thrilling Match Of Cricket In Pakistan, The Home. 

#PZvQG 
#ABC",976166125535973378,959235642
2018-03-20 18:38:35,"RT @thePSLt20: SIX! 19.4 Liam Dawson to Anwar Ali
Watch ball by ball highlights at (link removed)

#PZvQG #HBLPSL #PSL2018 @_crici…",976166126202839040,395163528
2018-03-20 18:38:35,RT @JeremyMcLellan: Rumor has it Amir Liaquat isn’t allowed to play in #PSL2018 because he keeps switching teams every week.,976166126483902466,3117825702
2018-03-20 18:38:35,RT @daniel86cricket: Peshawar beat Quetta by 1 run in one of the best T20 thrillers. PSL played in front of full house in Lahore Pakistan i…,976166126559354880,3310967346
2018-03-20 18:38:35,"I wanted a super over😭
#PZvQG",976166126836178944,701494826194354179
2018-03-20 18:38:35,RT @hinaparvezbutt: Congratulations Peshawar Zalmi over great win but Quetta Gladiators won our hearts ♥️  #PZvQG,976166126685171713,347132028
2018-03-20 18:38:35,"RT @walterMiitty: It's harder than I thought to tell the truth
It's gonna leave you in pieces
All alone with your demons
And I know that we…",976166126924201986,3461853618

Note: some of your fields contain newlines, so the output might look a bit odd. An application reading this though will handle it correctly (as long as you tell it the encoding is UTF-8 when importing it)

Upvotes: 6

Related Questions