Reputation: 1815
I have a user-own metric to implement as follows:
def metric(pred:pd.DataFrame(), valid:pd.DataFrame()):
date_begin = valid.dt.min()
date_end = valid.dt.max()
x = valid[valid.label == 1].dt.min()
# p
p_n_tpp_df = valid[(valid.dt >= x) &\
(valid.dt <= x + timedelta(days=30)) &\
(p_n_tpp_df.label == 1)]
p_n_pp_df = valid[(valid.dt >= date_begin + timedelta(days=30)) &\
(valid.dt <= date_end + timedelta(days=30)) &\
(p_n_tpp_df.label == 1)]
p_n_tpp = len([x for x in pred.serial_number.values\
if x in p_n_tpp_df.serial_number.unique()])
p_n_pp = len([x for x in pred.serial_number.values\
if x in p_n_pp_df.serial_number.unique()])
p = p_n_tpp / p_n_pp
print('p: ', p)
# r
p_n_tpr_df = valid[(valid.dt >= date_begin - timedelta(days=30)) &\
(valid.dt <= date_end - timedelta(days=30)) &\
(p_n_tpr_df.label == 1)]
p_n_pr_df = valid[(valid.dt >= date_begin) &\
(valid.dt <= date_end) &\
(p_n_pr_df.label == 1)]
p_n_tpr = len([x for x in pred.serial_number.values\
if x in p_n_tpr_df.serial_number.unique()])
p_n_pr = len([x for x in pred.serial_number.values\
if x in p_n_pr_df.serial_number.unique()])
r = p_n_tpr / p_n_pr
print('p: ', r)
m = 2 * p * r / (p + r)
return m
The pd.DataFrame()
of pred
and valid
have the same columns and dt
has no intersections.
And the all the values of serial_number
in valid
is a subset of all the values of serial_number
in pred
.
The label
column only has 2 values: 0 or 1.
Here is the sample of pred
and valid
is as follows:
print(pred.head(3))
serial_number dt label
0 123 2011-03-21 1
1 52 2011-03-22 0
2 12 2011-03-01 1
..., ...
print(pred.info())
Int64Index: 10000000 entries,
Data columns (total 3 columns):
serial_number int32
dt datetimes64[ns]
label int8
..., ...
print(valid.head(3))
serial_number dt label
0 324 2011-04-22 1
1 52 2011-04-22 0
2 14 2011-04-01 1
..., ...
print(valid.info())
Int64Index: 10000000 entries,
Data columns (total 3 columns):
serial_number int32
dt datetimes64[ns]
label int8
And the size of input pd.DataFrame
is about 10, 000, 000 samples and 3 features.
When I try to use it to calculate this metric, it is really slow and time spending is more than 2 hours on Intel 9600KF.
So I am wondering how to optimize such code on time cost.
Thanks in advance.
Upvotes: 5
Views: 322
Reputation: 2553
Here is the biggest performance win in the code that you have:
len([x for x in pred.serial_number.values\
if x in p_n_tpr_df.serial_number.unique()])
Any line that looks like this is getting the size of the set intersection of pred.serial_number
and p_n_tpr_df.serial_number
. Using numpy rather than the list comprehension and the unique
call will save substantial compute time:
intersect_size = np.intersect1d(pred.serial_number.values,
p_n_tpr_df.serial_number.values).shape[0]
Upvotes: 6