Shivam
Shivam

Reputation: 171

reading multiple files in a folder and creating a pandas dataframe

I am reading large pickle files to pandas dataframe, I loaded one of them and it is loaded i the manner, I need. But, I have a folder having 40 pickle files named as imdbnames0.pkl, imdbnames1.pkl, imdbnames2.pkl, ...., imdbnames40.pkl.

I want to load them all in similar manner as below and merge them altogether in a sinlge pandas dataframe.

fh = open("ethnicity_files/imdbnames1.pkl", 'rb')
d = pickle.load(fh)
df = pd.concat({k:json_normalize(v, 'scores', ['best']) for k,v in d.items()})
df = df.reset_index(level=1, drop=True).rename_axis('names').reset_index()
df.head()



names   ethnicity   score   best
0   !Gubi Tietie    Asian   0.03    GreaterEuropean
1   !Gubi Tietie    GreaterAfrican  0.01    GreaterEuropean
2   !Gubi Tietie    GreaterEuropean 0.96    GreaterEuropean
3   !Gubi Tietie    British 0.17    WestEuropean
4   !Gubi Tietie    Jewish  0.13    WestEuropean
5   !Gubi Tietie    WestEuropean    0.65    WestEuropean
6   !Gubi Tietie    EastEuropean    0.05    WestEuropean
7   !Gubi Tietie    Nordic  0.00    Italian
8   !Gubi Tietie    Italian 0.69    Italian
9   !Gubi Tietie    Hispanic    0.12    Italian
10  !Gubi Tietie    French  0.16    Italian
11  !Gubi Tietie    Germanic    0.02    Italian
12  $2 Tony Asian   0.00    GreaterEuropean
13  $2 Tony GreaterAfrican  0.00    GreaterEuropean
14  $2 Tony GreaterEuropean 1.00    GreaterEuropean
15  $2 Tony British 0.00    WestEuropean
16  $2 Tony Jewish  0.00    WestEuropean
17  $2 Tony WestEuropean    1.00    WestEuropean
18  $2 Tony EastEuropean    0.00    WestEuropean
19  $2 Tony Nordic  0.00    Italian

One file is the folloing https://drive.google.com/file/d/10cjsoWFJ46w-2lEsxh6hmuRZlLunatf-/view?usp=sharing .

I just want to add them all in one pandas dataframe.

Upvotes: 0

Views: 7671

Answers (2)

O.Suleiman
O.Suleiman

Reputation: 918

I think you need os.listdir():

#Be careful this might give you a memory error if you 
#don't have enough ram for all your files 
#and make sure the folder contains only the files you want to read
import os
files = os.listdir('ethnicity_files/')

list_of_dfs = []
for file in files:
    d = pickle.load(os.path.join('ethnicity_files/',file))
    df = pd.concat({k:json_normalize(v, 'scores', ['best']) for k,v in d.items()})
    df = df.reset_index(level=1, drop=True).rename_axis('names').reset_index()
    list_of_dfs.append(df)
big_df = pd.concat(list_of_dfs, ignore_index=True)#ignore_index to reset index of big_df
big_df.head()

Upvotes: 2

Pratik Kumar
Pratik Kumar

Reputation: 2231

you can use glob.glob to iterate all the files in the current folder with a specific extension(.pkl in your case)

import os
import glob
cd=os.getcwd()
os.chdir('path_to_your_folder')

for file in glob.glob("*.pkl"):
  fh = open(str(file), 'rb')
  d = pickle.load(fh)
  df = pd.concat({k:json_normalize(v, 'scores', ['best']) for k,v in d.items()})
  df = df.reset_index(level=1, drop=True).rename_axis('names').reset_index()
os.chdir(cd)
print df.head()

Upvotes: 1

Related Questions