For loop excluding some DataFrame rows based on a specified value

Question

I have folder with 10000+ files, which contain the data of 10 variables (X1, X2, ..., X10).

The name of the files is just File1.json, File2.json etc.

I need to create a dataframe for each variable, i.e. 10 DataFrames.

INPUT

VARIABLES= [X1, X2, ..., X10]
FILES= [File1.json, File2.json, ... File14347.json]

Desired output

Dataframe of X1, X2, ..., X10

I am doing the following

for i in range(0, len(VARIABLES)):
    %reset_selective -f "^DATA$"
    DATA=pd.DataFrame()
    Data_name=VARIABLES[i]
    print(Data_name)
    for ii in range(0, len(FILES)):

        file_name1='Directory/'
        file_name2= FILES[ii]
        file_name=file_name1+file_name2
        with open(file_name, 'r') as fer:
             data1 = json.load(fer)
        df = pd.DataFrame({'count': data1})

        Var_namei=df['count']['consistname']
        if Var_namei==Data_name:
            #create Dataframe

The code works fine for the first Variable, as I do not know which files contains the data of X1.

However, from the second iteration there is no point in reopening every single file in order to find the data of X2. Similarly, I should open only the files of X10 when I reach the last iteration.

I want to avoid opening/considering files whose data have been already used as an input to a DataFrame, e.g. File2 contains values of X1, therefore I don't want to open again File2 when I'm looking for the values of X2, X3, etc

I've tried to add

k.iloc[ii,i]= ii

where k is a dataframe of zeros size (File, VAriable) after the if condition, in order to put a 1 in the i column and ii rows when a file ii of the variable i is open. In this way, I can skip such ii rows files during next for iterations. But, I am unable to access the k values during the for loop.

any suggestion? thank you

gdlmx · Accepted Answer

Welcome to SO. Your code can be much simpler if refactoring a little.

file_name1='Directory/'
FileDATA={}
for file_name2 in FILES:
    file_name=file_name1+file_name2
    with open(file_name, 'r') as fer:
        data1 = json.load(fer)
    if data1['consistname'] in VARIABLES:
        # Save the data1 object to FileDATA
        # Assuming that every element in VARIABLES is unique
        Data_name=data1['consistname']
        FileDATA[Data_name] = data1

for Data_name in VARIABLES:
    data1 = FileDATA[Data_name]
    df = pd.DataFrame({'count': data1})
    # create Dataframe

The first loop goes through all files once and save the data which corresponds to a variable of [X1, X2, ..., X10] in a dictionary FileDATA. Then you can loop over the variables to process the data.

After removing unnecessary lines,

FileDATA={}
for file_name2 in FILES:
    with open('Directory/' + file_name2, 'r') as fer:
        data1 = json.load(fer)
    if data1['consistname'] in VARIABLES:
        FileDATA[data1['consistname']] = data1

for Data_name in VARIABLES:
    df = pd.DataFrame({'count': FileDATA[Data_name]})
    # create Dataframe

For loop excluding some DataFrame rows based on a specified value

Answers (1)

Related Questions