astroluv
astroluv

Reputation: 793

How to calculate the mean and standard deviation of multiple dataframes at one go?

I've several hundreds of pandas dataframes and And the number of rows are not exactly the same in all the dataframes like some have 600 but other have 540 only.

So what i want to do is like, i have two samples of exactly the same numbers of dataframes and i want to read all the dataframes(around 2000) from both the samples. So that's how thee data looks like and i can read the files like this:

5113.440  1     0.25846     0.10166    27.96867     0.94852    -0.25846   268.29305     5113.434129
5074.760  3     0.68155     0.16566   120.18771     3.02654    -0.68155   101.02457     5074.745627
5083.340  2     0.74771     0.13267   105.59355     2.15700    -0.74771   157.52406     5083.337081
5088.150  1     0.28689     0.12986    39.65747     2.43339    -0.28689   164.40787     5088.141849
5090.780  1     0.61464     0.14479    94.72901     2.78712    -0.61464   132.25865     5090.773443

#first Sample
path_to_files = '/home/Desktop/computed_2d_blaze/'
lst = []
for filen in [x for x in os.listdir(path_to_files) if '.ares' in x]:
   df = pd.read_table(path_to_files+filen, skiprows=0, usecols=(0,1,2,3,4,8),names=['wave','num','stlines','fwhm','EWs','MeasredWave'],delimiter=r'\s+')
   df = df.sort_values('stlines', ascending=False)
   df = df.drop_duplicates('wave')
   df = df.reset_index(drop=True)
   lst.append(df)


#second sample

path_to_files1 = '/home/Desktop/computed_1d/'
lst1 = []
for filen in [x for x in os.listdir(path_to_files1) if '.ares' in x]:
   df1 = pd.read_table(path_to_files1+filen, skiprows=0, usecols=(0,1,2,3,4,8),names=['wave','num','stlines','fwhm','EWs','MeasredWave'],delimiter=r'\s+')
   df1 = df1.sort_values('stlines', ascending=False)
   df1 = df1.drop_duplicates('wave')
   df1 = df1.reset_index(drop=True)
   lst1.append(df1)

Now the data is stored in lists and as the number of rows in all the dataframes are not same so i cant subtract them directly.

So how can i subtract them correctly?? And after that i want to take average(mean) of the residual to make a dataframe?

Upvotes: 0

Views: 341

Answers (2)

PMende
PMende

Reputation: 5460

You shouldn't use apply. Just use Boolean making:

mask = df['waves'].between(lower_outlier, upper_outlier)
df[mask].plot(x='waves', y='stlines')

Upvotes: 1

An economist
An economist

Reputation: 1311

One solution that comes into mind is writing a function that finds outliers based on upper and lower bounds and then slices the data frames based on outliers index e.g.

df1 = pd.DataFrame({'wave': [1, 2, 3, 4, 5]})

df2 = pd.DataFrame({'stlines': [0.1, 0.2, 0.3, 0.4, 0.5]})

def outlier(value, upper, lower):
    """
    Find outliers based on upper and lower bound
    """
    # Check if input value is within bounds
    in_bounds = (value <= upper) and (value >= lower) 

    return in_bounds 

# Function finds outliers in wave column of DF1
outlier_index = df1.wave.apply(lambda x: outlier(x, 4, 1))

# Return DF2 without values at outlier index
df2[outlier_index]

# Return DF1 without values at outlier index
df1[outlier_index]

Upvotes: 1

Related Questions