broepke
broepke

Reputation: 71

How to Read a Text File of Dictionaries into a DataFrame

I have a text file from Kaggle of Clash Royale stats. It's in a format of Python Dictionaries. I am struggling to find out how to read that into a file in a meaningful way. Curious what the best way is to do this. It's a fairly complex Dict with Lists.

Original Dataset here: https://www.kaggle.com/s1m0n38/clash-royale-matches-dataset

{'players': {'right': {'deck': [['Mega Minion', '9'], ['Electro Wizard', '3'], ['Arrows', '11'], ['Lightning', '5'], ['Tombstone', '9'], ['The Log', '2'], ['Giant', '9'], ['Bowler', '5']], 'trophy': '4258', 'clan': 'TwoFiveOne', 'name': 'gpa raid'}, 'left': {'deck': [['Fireball', '9'], ['Archers', '12'], ['Goblins', '12'], ['Minions', '11'], ['Bomber', '12'], ['The Log', '2'], ['Barbarians', '12'], ['Royal Giant', '13']], 'trophy': '4325', 'clan': 'battusai', 'name': 'Supr4'}}, 'type': 'ladder', 'result': ['2', '0'], 'time': '2017-07-12'}
{'players': {'right': {'deck': [['Ice Spirit', '10'], ['Valkyrie', '9'], ['Hog Rider', '9'], ['Inferno Tower', '9'], ['Goblins', '12'], ['Musketeer', '9'], ['Zap', '12'], ['Fireball', '9']], 'trophy': '4237', 'clan': 'The Wolves', 'name': 'TITAN'}, 'left': {'deck': [['Royal Giant', '13'], ['Ice Wizard', '2'], ['Bomber', '12'], ['Knight', '12'], ['Fireball', '9'], ['Barbarians', '12'], ['The Log', '2'], ['Archers', '12']], 'trophy': '4296', 'clan': 'battusai', 'name': 'Supr4'}}, 'type': 'ladder', 'result': ['1', '0'], 'time': '2017-07-12'}
{'players': {'right': {'deck': [['Miner', '3'], ['Ice Golem', '9'], ['Spear Goblins', '12'], ['Minion Horde', '12'], ['Inferno Tower', '8'], ['The Log', '2'], ['Skeleton Army', '6'], ['Fireball', '10']], 'trophy': '4300', 'clan': '@LA PERLA NEGRA', 'name': 'Victor'}, 'left': {'deck': [['Royal Giant', '13'], ['Ice Wizard', '2'], ['Bomber', '12'], ['Knight', '12'], ['Fireball', '9'], ['Barbarians', '12'], ['The Log', '2'], ['Archers', '12']], 'trophy': '4267', 'clan': 'battusai', 'name': 'Supr4'}}, 'type': 'ladder', 'result': ['0', '1'], 'time': '2017-07-12'}

Upvotes: 2

Views: 2926

Answers (3)

Trenton McKinney
Trenton McKinney

Reputation: 62403

  • The other answers only work with the toy data, as presented in the OP. This answer deals with the actual file from Kaggle, and how to clean it.
  • The Kaggle file, matches.txt, is rows of nested dicts
    • Within the file, each row has 4 top level keys, ['players', 'type', 'result', 'time']
  • Read the file in, which will make each row a str type
  • Convert it from str to dict type with ast.literal_eval
    • Some of the rows are not correctly formatted, and will result in a SyntaxError
  • The data can be converted to a dataframe with pandas.json_normalize

Imports

import pandas as pd
from ast import literal_eval

Clean the File

# store the data
data = list()

# store the broken rows
broken_row = list()

# read in the file
with open('matches.txt', 'r', encoding='utf-8') as f:  
    
    # read the rows
    rows = f.readlines()
    for row in rows:
        
        # try to convert a row from a string to dict
        try:
            row = literal_eval(row)
            data.append(row)
        except SyntaxError:
            broken_row.append(row)
            continue

Convert data to a long DataFrame

  • For each match, each 'players.right.deck', 'players.left.deck' gets a separate row.
# convert data to a dataframe
players = pd.json_normalize(data)

# add a unique id for each row, which can be used to identify players for a particular game
df['id'] = df.index

# split the list of lists in right.deck and left.deck to separate rows
players = df[['id', 'players.right.deck', 'players.left.deck']].apply(pd.Series.explode).reset_index(drop=True)

# drop the original columns
df.drop(columns=['players.right.deck', 'players.left.deck'], inplace=True)

# right.deck and left.deck are still a list with two values, which need to have separate columns
players[['right.deck.name', 'right.deck.number']] = pd.DataFrame(players.pop('players.right.deck').values.tolist())
players[['left.deck.name', 'left.deck.number']] = pd.DataFrame(players.pop('players.left.deck').values.tolist())

# separate the result column into two columns
df[['right.result', 'left.result']] = pd.DataFrame(df.pop('result').values.tolist())

# merge df with players
df = df.merge(players, on='id')

df.head(8)

     type        time players.right.trophy players.right.clan players.right.name players.left.trophy players.left.clan players.left.name  id right.result left.result right.deck.name right.deck.number left.deck.name left.deck.number
0  ladder  2017-07-12                 4258         TwoFiveOne           gpa raid                4325          battusai             Supr4   0            2           0     Mega Minion                 9       Fireball                9
1  ladder  2017-07-12                 4258         TwoFiveOne           gpa raid                4325          battusai             Supr4   0            2           0  Electro Wizard                 3        Archers               12
2  ladder  2017-07-12                 4258         TwoFiveOne           gpa raid                4325          battusai             Supr4   0            2           0          Arrows                11        Goblins               12
3  ladder  2017-07-12                 4258         TwoFiveOne           gpa raid                4325          battusai             Supr4   0            2           0       Lightning                 5        Minions               11
4  ladder  2017-07-12                 4258         TwoFiveOne           gpa raid                4325          battusai             Supr4   0            2           0       Tombstone                 9         Bomber               12
5  ladder  2017-07-12                 4258         TwoFiveOne           gpa raid                4325          battusai             Supr4   0            2           0         The Log                 2        The Log                2
6  ladder  2017-07-12                 4258         TwoFiveOne           gpa raid                4325          battusai             Supr4   0            2           0           Giant                 9     Barbarians               12
7  ladder  2017-07-12                 4258         TwoFiveOne           gpa raid                4325          battusai             Supr4   0            2           0          Bowler                 5    Royal Giant               13

Convert data to a wide DataFrame

  • This option uses the flatten_json function.
  • For each match, each 'players.right.deck', 'players.left.deck' gets a separate column.
# convert data to a wide dataframe
df = pd.DataFrame([flatten_json(x) for x in data])

# display(df.head(3))
  players_right_deck_0_0 players_right_deck_0_1 players_right_deck_1_0 players_right_deck_1_1 players_right_deck_2_0 players_right_deck_2_1 players_right_deck_3_0 players_right_deck_3_1 players_right_deck_4_0 players_right_deck_4_1 players_right_deck_5_0 players_right_deck_5_1 players_right_deck_6_0 players_right_deck_6_1 players_right_deck_7_0 players_right_deck_7_1 players_right_trophy players_right_clan players_right_name players_left_deck_0_0 players_left_deck_0_1 players_left_deck_1_0 players_left_deck_1_1 players_left_deck_2_0 players_left_deck_2_1 players_left_deck_3_0 players_left_deck_3_1 players_left_deck_4_0 players_left_deck_4_1 players_left_deck_5_0 players_left_deck_5_1 players_left_deck_6_0 players_left_deck_6_1 players_left_deck_7_0 players_left_deck_7_1 players_left_trophy players_left_clan players_left_name    type result_0 result_1        time
0            Mega Minion                      9         Electro Wizard                      3                 Arrows                     11              Lightning                      5              Tombstone                      9                The Log                      2                  Giant                      9                 Bowler                      5                 4258         TwoFiveOne           gpa raid              Fireball                     9               Archers                    12               Goblins                    12               Minions                    11                Bomber                    12               The Log                     2            Barbarians                    12           Royal Giant                    13                4325          battusai             Supr4  ladder        2        0  2017-07-12
1             Ice Spirit                     10               Valkyrie                      9              Hog Rider                      9          Inferno Tower                      9                Goblins                     12              Musketeer                      9                    Zap                     12               Fireball                      9                 4237         The Wolves              TITAN           Royal Giant                    13            Ice Wizard                     2                Bomber                    12                Knight                    12              Fireball                     9            Barbarians                    12               The Log                     2               Archers                    12                4296          battusai             Supr4  ladder        1        0  2017-07-12
2                  Miner                      3              Ice Golem                      9          Spear Goblins                     12           Minion Horde                     12          Inferno Tower                      8                The Log                      2          Skeleton Army                      6               Fireball                     10                 4300    @LA PERLA NEGRA             Victor           Royal Giant                    13            Ice Wizard                     2                Bomber                    12                Knight                    12              Fireball                     9            Barbarians                    12               The Log                     2               Archers                    12                4267          battusai             Supr4  ladder        0        1  2017-07-12

Upvotes: 0

Jab
Jab

Reputation: 27495

I saved your data to .json files, then just needed to loop through each line and treat it as it's own JSON file, then I used pandas.json_normalize to load it into a DataFrame and I made some guesses at how you wanted the df to look but I came up with this:

note: proper JSON needs to have double quotes not single so I used replace to work around this. Be careful that no data inside is destroyed using this.

note: The way I got this to work, I had to merge 'right' and 'left' so you are losing this data. If this is needed you could use a dict comp as a workaround

import json
import pandas as pd

with open('cr.json', 'r') as f:
    df = None
    for line in f:
        data = json.loads(line.replace("'", '"'))
        #needed to put the right and left keys together, maybe you can find a way around this, I wasn't
        df1 = pd.json_normalize([data['players']['right'], data['players']['left']],
                     'deck',
                     ['name', 'trophy', 'clan'],
                     meta_prefix='player.',
                     errors='ignore')
        df = pd.concat([df, df1])
    df.rename(columns={0: 'player.troop.name', 1: 'player.troop.level'}, 
              inplace=True)
    print(df)

This prints:

   player.troop.name player.troop.level player.name      player.clan  \
0        Mega Minion                  9    gpa raid       TwoFiveOne   
1     Electro Wizard                  3    gpa raid       TwoFiveOne   
2             Arrows                 11    gpa raid       TwoFiveOne   
3          Lightning                  5    gpa raid       TwoFiveOne   
4          Tombstone                  9    gpa raid       TwoFiveOne   
5            The Log                  2    gpa raid       TwoFiveOne   
6              Giant                  9    gpa raid       TwoFiveOne   
7             Bowler                  5    gpa raid       TwoFiveOne   
8           Fireball                  9       Supr4         battusai   
9            Archers                 12       Supr4         battusai   
10           Goblins                 12       Supr4         battusai   
11           Minions                 11       Supr4         battusai   
12            Bomber                 12       Supr4         battusai   
13           The Log                  2       Supr4         battusai   
14        Barbarians                 12       Supr4         battusai   
15       Royal Giant                 13       Supr4         battusai   
0         Ice Spirit                 10       TITAN       The Wolves   
1           Valkyrie                  9       TITAN       The Wolves   
2          Hog Rider                  9       TITAN       The Wolves   
3      Inferno Tower                  9       TITAN       The Wolves   
4            Goblins                 12       TITAN       The Wolves   
5          Musketeer                  9       TITAN       The Wolves   
6                Zap                 12       TITAN       The Wolves   
7           Fireball                  9       TITAN       The Wolves   
8        Royal Giant                 13       Supr4         battusai   
9         Ice Wizard                  2       Supr4         battusai   
10            Bomber                 12       Supr4         battusai   
11            Knight                 12       Supr4         battusai   
12          Fireball                  9       Supr4         battusai   
13        Barbarians                 12       Supr4         battusai   
14           The Log                  2       Supr4         battusai   
15           Archers                 12       Supr4         battusai   
0              Miner                  3      Victor  @LA PERLA NEGRA   
1          Ice Golem                  9      Victor  @LA PERLA NEGRA   
2      Spear Goblins                 12      Victor  @LA PERLA NEGRA   
3       Minion Horde                 12      Victor  @LA PERLA NEGRA   
4      Inferno Tower                  8      Victor  @LA PERLA NEGRA   
5            The Log                  2      Victor  @LA PERLA NEGRA   
6      Skeleton Army                  6      Victor  @LA PERLA NEGRA   
7           Fireball                 10      Victor  @LA PERLA NEGRA   
8        Royal Giant                 13       Supr4         battusai   
9         Ice Wizard                  2       Supr4         battusai   
10            Bomber                 12       Supr4         battusai   
11            Knight                 12       Supr4         battusai   
12          Fireball                  9       Supr4         battusai   
13        Barbarians                 12       Supr4         battusai   
14           The Log                  2       Supr4         battusai   
15           Archers                 12       Supr4         battusai   

   player.trophy  
0           4258  
1           4258  
2           4258  
3           4258  
4           4258  
5           4258  
6           4258  
7           4258  
8           4325  
9           4325  
10          4325  
11          4325  
12          4325  
13          4325  
14          4325  
15          4325  
0           4237  
1           4237  
2           4237  
3           4237  
4           4237  
5           4237  
6           4237  
7           4237  
8           4296  
9           4296  
10          4296  
11          4296  
12          4296  
13          4296  
14          4296  
15          4296  
0           4300  
1           4300  
2           4300  
3           4300  
4           4300  
5           4300  
6           4300  
7           4300  
8           4267  
9           4267  
10          4267  
11          4267  
12          4267  
13          4267  
14          4267  
15          4267

And df.iloc[0] is as follows:

player.troop.name Mega Minion
player.troop.level         9
player.name         gpa raid
player.trophy           4258
player.clan       TwoFiveOne
Name: 0, dtype: object

You can rework the json_normalize parameters how you see fit, but I hope this is more than enough to get you going

Upvotes: 2

James Dellinger
James Dellinger

Reputation: 1261

According to this dataset's synopsis on kaggle, each dictionary represents a match between two players. I felt it would make sense to have each row in the dataframe represent all the characteristics of a single match.

This can be accomplished in a few short steps.

  1. Store all the match dictionaries (each row of the dataset from kaggle) inside one list:
matches = [
{'players': {'right': {'deck': [['Mega Minion', '9'], ['Electro Wizard', '3'], ['Arrows', '11'], ['Lightning', '5'], ['Tombstone', '9'], ['The Log', '2'], ['Giant', '9'], ['Bowler', '5']], 'trophy': '4258', 'clan': 'TwoFiveOne', 'name': 'gpa raid'}, 'left': {'deck': [['Fireball', '9'], ['Archers', '12'], ['Goblins', '12'], ['Minions', '11'], ['Bomber', '12'], ['The Log', '2'], ['Barbarians', '12'], ['Royal Giant', '13']], 'trophy': '4325', 'clan': 'battusai', 'name': 'Supr4'}}, 'type': 'ladder', 'result': ['2', '0'], 'time': '2017-07-12'},
{'players': {'right': {'deck': [['Ice Spirit', '10'], ['Valkyrie', '9'], ['Hog Rider', '9'], ['Inferno Tower', '9'], ['Goblins', '12'], ['Musketeer', '9'], ['Zap', '12'], ['Fireball', '9']], 'trophy': '4237', 'clan': 'The Wolves', 'name': 'TITAN'}, 'left': {'deck': [['Royal Giant', '13'], ['Ice Wizard', '2'], ['Bomber', '12'], ['Knight', '12'], ['Fireball', '9'], ['Barbarians', '12'], ['The Log', '2'], ['Archers', '12']], 'trophy': '4296', 'clan': 'battusai', 'name': 'Supr4'}}, 'type': 'ladder', 'result': ['1', '0'], 'time': '2017-07-12'},
{'players': {'right': {'deck': [['Miner', '3'], ['Ice Golem', '9'], ['Spear Goblins', '12'], ['Minion Horde', '12'], ['Inferno Tower', '8'], ['The Log', '2'], ['Skeleton Army', '6'], ['Fireball', '10']], 'trophy': '4300', 'clan': '@LA PERLA NEGRA', 'name': 'Victor'}, 'left': {'deck': [['Royal Giant', '13'], ['Ice Wizard', '2'], ['Bomber', '12'], ['Knight', '12'], ['Fireball', '9'], ['Barbarians', '12'], ['The Log', '2'], ['Archers', '12']], 'trophy': '4267', 'clan': 'battusai', 'name': 'Supr4'}}, 'type': 'ladder', 'result': ['0', '1'], 'time': '2017-07-12'}
]
  1. Create a dataframe from the above list, which will automatically populate columns that contain info for the type, time, and result of the match:
df = pd.DataFrame(matches)
  1. Then, use some simple logic to populate columns containing info on the deck, trophy, clan, and name of both the left and right players in the match:
sides = ['right', 'left']
player_keys = ['deck', 'trophy', 'clan', 'name']

for side in sides:
    for key in player_keys:
        for i, row in df.iterrows():
            df[side + '_' + key] = df['players'].apply(lambda x: x[side][key])

df = df.drop('players', axis=1) # no longer need this after populating the other columns

df = df.iloc[:, ::-1] # made sense to display columns in order of player info from left to right,
                      # followed by general match info at the far right of the dataframe

The resulting dataframe looks like this:

    left_name   left_clan   left_trophy   left_deck                                           right_name    right_clan  right_trophy    right_deck                                          type    time         result
0   Supr4       battusai           4325   [[Fireball, 9], [Archers, 12], [Goblins, 12], ...   gpa raid      TwoFiveOne          4258    [[Mega Minion, 9], [Electro Wizard, 3], [Arrow...   ladder  2017-07-12   [2, 0]
1   Supr4       battusai           4296   [[Royal Giant, 13], [Ice Wizard, 2], [Bomber, ...   TITAN The     Wolves              4237    [[Ice Spirit, 10], [Valkyrie, 9], [Hog Rider, ...   ladder  2017-07-12   [1, 0]
2   Supr4       battusai           4267   [[Royal Giant, 13], [Ice Wizard, 2], [Bomber, ...   Victor        @LA PERLA NEGRA     4300    [[Miner, 3], [Ice Golem, 9], [Spear Goblins, 1...   ladder  2017-07-12   [0, 1]

Upvotes: 2

Related Questions