Reputation: 23
Due to performance I had to split up my data in several dataframes. Each frame has 300 columns and 800.000 rows.
The dataframes are named df0, df1 ... df29.
I have created them like this:
for j in range (0,30):
globals()['df' + str(j)] = pd.read_parquet('C:\\Users\\helle\\Documents\\Jenny_Analytics\\train.parquet', columns=chunks[j])
Now I want to check data quality for all data frames and I want to do something like this:
df_describeAll = pd.DataFrame(columns=["Dataframe","Count"])
df_describeAll.head()
for j in range (0,29):
cnt=df{j}.count()
for i in range (0,cnt.size):
if cnt[i] <800000:
df_describeAll["Dataframe"]='df' + str(j)
df_describeAll["Count"]=df.count()
My current problem is cnt=df{j}.count()
, I also tried ['df' + str(j)].count()
, but it never recognizes the variable as a dataframe. If I try df0.count()
or df10.count()
it returns a series as expected.
So what I want to do is iterate over all dataframes and put them in my df_describeAll
whenever a column has a count below 800.000.
I think I am misusing the global variable and I really would appreciate any help! Thanks in advance
Upvotes: 0
Views: 1094
Reputation: 13393
just load them into a list, then access them by index:
dfs = [pd.read_parquet('C:\\Users\\helle\\Documents\\Jenny_Analytics\\train.parquet', columns=chunks[j]) for j in range (0,30)]
for j in range(0,29):
cnt = dfs[j].count()
...
Upvotes: 1