Reputation: 1749
Below is a code snippet of the task I have to perform. I want to have the best time complexity possible for this task. Is there a library or a better implementation?
import pandas as pd
import numpy as np
df1 = pd.DataFrame({"v_id": [1,1,1,2,2,3,3,3,3,5,5], \
"date": ['01-01-2017', '02-01-2017', '03-01-2017',\
'02-01-2017', '03-01-2017',
'01-01-2017', '02-01-2017', '03-01-2017', '04-01-2017',\
'05-01-2017','04-01-2017'],\
"value":[0.9,1.5,2.4,7.1,0.4,1.8,5.1, 6.4, 7.7, 3.9, 0.3]})
dim1, dim2 = df1.v_id.unique(), df1.date.unique()
dim1_dict, dim2_dict = dict(zip(dim1, list(range(0, len(dim1))))), \
dict(zip(dim2, list(range(0, len(dim2)))))
value_result, date_result = np.empty((len(dim1), len(dim2)), dtype=object),\
np.empty((len(dim1), len(dim2)), dtype=object)
for i in range(0, len(df1)):
date_result[dim1_dict.get(df1.loc[i].v_id), \
dim2_dict.get(df1.loc[i].date)] = df1.loc[i].date
value_result[dim1_dict.get(df1.loc[i].v_id), \
dim2_dict.get(df1.loc[i].date)] = df1.loc[i].value
The dataframe df1 looks like this:
My goal is to get the date_result (v_id x date)
array([['01-01-2017', '02-01-2017', '03-01-2017', None, None],
[None, '02-01-2017', '03-01-2017', None, None],
['01-01-2017', '02-01-2017', '03-01-2017', '04-01-2017', None],
[None, None, None, '04-01-2017', '05-01-2017']], dtype=object)
& value_result matrix (v_id x value).
array([[0.90000000000000002, 1.5, 2.3999999999999999, None, None],
[None, 7.0999999999999996, 0.40000000000000002, None, None],
[1.8, 5.0999999999999996, 6.4000000000000004, 7.7000000000000002,
None],
[None, None, None, 0.29999999999999999, 3.8999999999999999]], dtype=object)
Upvotes: 0
Views: 98
Reputation: 879201
You could perform this calculation using pivot_table
:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({"v_id": [1,1,1,2,2,3,3,3,3,5,5],
"date": ['01-01-2017', '02-01-2017', '03-01-2017',
'02-01-2017', '03-01-2017',
'01-01-2017', '02-01-2017', '03-01-2017', '04-01-2017',
'05-01-2017','04-01-2017'],
"value":[0.9,1.5,2.4,7.1,0.4,1.8,5.1, 6.4, 7.7, 3.9, 0.3]})
date_result = (df1.assign(date2=df1['date'])
.pivot_table(columns='date', index='v_id',
values='date2', aggfunc='first').values)
value_result = df1.pivot_table(columns='date', index='v_id',
values='value', aggfunc='first').values
print(date_result)
print(value_result)
yields a date_result
of:
array([['01-01-2017', '02-01-2017', '03-01-2017', None, None],
[None, '02-01-2017', '03-01-2017', None, None],
['01-01-2017', '02-01-2017', '03-01-2017', '04-01-2017', None],
[None, None, None, '04-01-2017', '05-01-2017']], dtype=object)
and value_result
of
array([[ 0.9, 1.5, 2.4, nan, nan],
[ nan, 7.1, 0.4, nan, nan],
[ 1.8, 5.1, 6.4, 7.7, nan],
[ nan, nan, nan, 0.3, 3.9]])
Note that value_result
is a NumPy array with floating-point dtype, and that the missing values are represented by nan
s not None
. You can convert it to a NumPy array with object
dtype and missing values of None
by using
value_result = np.where(pd.isnull(value_result), None, value_result)
which yields
array([[0.9, 1.5, 2.4, None, None],
[None, 7.1, 0.4, None, None],
[1.8, 5.1, 6.4, 7.7, None],
[None, None, None, 0.3, 3.9]], dtype=object)
Upvotes: 2