Reputation: 37
I have folder with 10000+ files, which contain the data of 10 variables (X1, X2, ..., X10).
The name of the files is just File1.json, File2.json etc.
I need to create a dataframe for each variable, i.e. 10 DataFrames.
INPUT
Desired output
I am doing the following
for i in range(0, len(VARIABLES)):
%reset_selective -f "^DATA$"
DATA=pd.DataFrame()
Data_name=VARIABLES[i]
print(Data_name)
for ii in range(0, len(FILES)):
file_name1='Directory/'
file_name2= FILES[ii]
file_name=file_name1+file_name2
with open(file_name, 'r') as fer:
data1 = json.load(fer)
df = pd.DataFrame({'count': data1})
Var_namei=df['count']['consistname']
if Var_namei==Data_name:
#create Dataframe
The code works fine for the first Variable, as I do not know which files contains the data of X1.
However, from the second iteration there is no point in reopening every single file in order to find the data of X2. Similarly, I should open only the files of X10 when I reach the last iteration.
I want to avoid opening/considering files whose data have been already used as an input to a DataFrame, e.g. File2 contains values of X1, therefore I don't want to open again File2 when I'm looking for the values of X2, X3, etc
I've tried to add
k.iloc[ii,i]= ii
where k is a dataframe of zeros size (File, VAriable) after the if condition, in order to put a 1 in the i column and ii rows when a file ii of the variable i is open. In this way, I can skip such ii rows files during next for iterations. But, I am unable to access the k values during the for loop.
any suggestion? thank you
Upvotes: 2
Views: 84
Reputation: 6789
Welcome to SO. Your code can be much simpler if refactoring a little.
file_name1='Directory/'
FileDATA={}
for file_name2 in FILES:
file_name=file_name1+file_name2
with open(file_name, 'r') as fer:
data1 = json.load(fer)
if data1['consistname'] in VARIABLES:
# Save the data1 object to FileDATA
# Assuming that every element in VARIABLES is unique
Data_name=data1['consistname']
FileDATA[Data_name] = data1
for Data_name in VARIABLES:
data1 = FileDATA[Data_name]
df = pd.DataFrame({'count': data1})
# create Dataframe
The first loop goes through all files once and save the data which corresponds to a variable of [X1, X2, ..., X10]
in a dictionary FileDATA
. Then you can loop over the variables to process the data.
After removing unnecessary lines,
FileDATA={}
for file_name2 in FILES:
with open('Directory/' + file_name2, 'r') as fer:
data1 = json.load(fer)
if data1['consistname'] in VARIABLES:
FileDATA[data1['consistname']] = data1
for Data_name in VARIABLES:
df = pd.DataFrame({'count': FileDATA[Data_name]})
# create Dataframe
Upvotes: 1