Reputation: 3084
I have a dataframe that is similar to the example below:
sample = {'col1': [50.6, 30.67, 40.5, 0, 0, 0],
'col2': [40.74, 30.33, 41.00, 0, 0, 0]}
df_sample = pd.DataFrame(sample)
Now, in both col2
and col3
however, the entries represent two different values. For example, for entry 50.6
, that represents val1 = 5
and val2 = 0.6
. Another example would be 41.00
. This value represents 4
and 1.0
.
Basically, what I want to get is a column which can be computed as follows:
df_sample['res'] = df_sample.apply(lambda x:
((x['col2']//10)*(x['col2']%10) + (x['col3']//10)*(x['col3']%10))
/ (x['col2']//10 + x['col3']//10), axis=1)
df_sample.fillna(0)
Basically, it gets the weighted average from the values obtained from each column. Now, what I want to do is scale this method to work with let's say twenty columns without hardcoding each column name in the DataFrame. Please advise.
Upvotes: 1
Views: 50
Reputation: 863531
You can omit apply
and rather use Series
(columns of Dataframes
):
sample = {'col2': [50.6, 30.67, 40.5, 0, 0, 0],
'col3': [40.74, 30.33, 41.00, 0, 0, 0],
'col4': [70.6, 80.67, 70.5, 0, 0, 0],
'col5': [10.74, 50.33, 51.00, 0, 0, 0]}
df_sample = pd.DataFrame(sample)
print (df_sample)
col2 col3 col4 col5
0 50.60 40.74 70.60 10.74
1 30.67 30.33 80.67 50.33
2 40.50 41.00 70.50 51.00
3 0.00 0.00 0.00 0.00
4 0.00 0.00 0.00 0.00
5 0.00 0.00 0.00 0.00
I think you need:
print ((((df_sample['col2']//10 * df_sample['col2']%10) +
(df_sample['col3']//10 * df_sample['col3']%10) +
(df_sample['col4']//10 * df_sample['col4']%10) +
(df_sample['col5']//10 * df_sample['col5']%10))
/ (df_sample['col2']//10 + df_sample['col3']//10 +
df_sample['col4']//10 + df_sample['col5']//10)).fillna(0))
0 0.641176
1 0.526842
2 0.725000
3 0.000000
4 0.000000
5 0.000000
dtype: float64
print (((df_sample//10 * df_sample%10).sum(axis=1).div((df_sample//10).sum(axis=1)))
.fillna(0))
0 0.641176
1 0.526842
2 0.725000
3 0.000000
4 0.000000
5 0.000000
dtype: float64
Timings:
In [114]: %timeit ((((df_sample['col2']//10 * df_sample['col2']%10) + (df_sample['col3']//10 * df_sample['col3']%10) + (df_sample['col4']//10 * df_sample['col4']%10) + (df_sample['col5']//10 * df_sample['col5']%10)) / (df_sample['col2']//10 + df_sample['col3']//10 + df_sample['col4']//10 + df_sample['col5']//10)).fillna(0))
100 loops, best of 3: 2.03 ms per loop
In [115]: %timeit (((df_sample//10 * df_sample%10).sum(axis=1).div((df_sample//10).sum(axis=1))).fillna(0))
1000 loops, best of 3: 897 µs per loop
Upvotes: 0
Reputation: 29719
Just create a subset of the columns you want to use for computing and you can perform the operation on the subsetted df
itself and not calling functions on every series object:
np.random.seed(42)
df = pd.DataFrame(np.random.uniform(0, 100, (100, 25))).add_prefix('col')
df.shape
(100, 25)
# Take first 20 columns (for eg)
df_sample = df.iloc[:, :20]
df['res'] = (df_sample // 10 * df_sample % 10).sum(1)/(df_sample // 10).sum(1)
Upvotes: 1