Jennifer Hellemann
Jennifer Hellemann

Reputation: 23

Iterate over dataframes with for loop

Due to performance I had to split up my data in several dataframes. Each frame has 300 columns and 800.000 rows.

The dataframes are named df0, df1 ... df29.

I have created them like this:

for j in range (0,30):
    globals()['df' + str(j)] = pd.read_parquet('C:\\Users\\helle\\Documents\\Jenny_Analytics\\train.parquet', columns=chunks[j])

Now I want to check data quality for all data frames and I want to do something like this:

df_describeAll = pd.DataFrame(columns=["Dataframe","Count"])
df_describeAll.head()

for j in range (0,29):
   cnt=df{j}.count()

    for i in range (0,cnt.size):
         if cnt[i] <800000:
              df_describeAll["Dataframe"]='df' + str(j)
              df_describeAll["Count"]=df.count()

My current problem is cnt=df{j}.count(), I also tried ['df' + str(j)].count(), but it never recognizes the variable as a dataframe. If I try df0.count() or df10.count() it returns a series as expected.

So what I want to do is iterate over all dataframes and put them in my df_describeAll whenever a column has a count below 800.000.

I think I am misusing the global variable and I really would appreciate any help! Thanks in advance

Upvotes: 0

Views: 1094

Answers (1)

Adam.Er8
Adam.Er8

Reputation: 13393

just load them into a list, then access them by index:

dfs = [pd.read_parquet('C:\\Users\\helle\\Documents\\Jenny_Analytics\\train.parquet', columns=chunks[j]) for j in range (0,30)]

for j in range(0,29):
    cnt = dfs[j].count()
    ...

Upvotes: 1

Related Questions