Reputation: 903
I have a DataFrame that looks like the example below.
# define DataFrame for reproducability
df = pd.DataFrame({'date': ['2019-05-06', '2019-05-07', '2019-05-07', '2019-05-09', '2019-05-10', '2019-05-11'],
'Identifier': [1, 1, 1, 1, 1, 1],
'B': [2.4, 3.9, 3.9, 4.3, 2.5, 3.14],
'C': [0.214, 0.985, 0.985, 0.839, 0.555, 0.159],
'Name': [np.nan, "CD", "AD", np.nan, np.nan, np.nan]})
print(df)
date Identifier B C Name
0 2019-05-06 1 2.40 0.214 NaN
1 2019-05-07 1 3.90 0.985 CD
2 2019-05-07 1 3.90 0.985 AD
3 2019-05-09 1 4.30 0.839 NaN
4 2019-05-10 1 2.50 0.555 NaN
5 2019-05-11 1 3.14 0.159 NaN
What can be seen is that, for a given identifier, there can be more than one name. However, the name is only appended to the DataFrame once at a single date. What I need is to basically forward and backward fill the names at every date. Currently, I have a solution that works, but that is extremely slow for the full dataframe that I am working on. The code is shown below
final_df = pd.DataFrame()
for i in df.Identifier.unique():
# select the current identifier
identifier_df = df.loc[df.Identifier == i]
# allow a given identifier to have different names
for n in df.Name.unique():
if pd.isna(n):
continue
else:
intermediate = identifier_df.copy()
intermediate.loc[:,"Name"] = np.repeat(n, len(intermediate))
final_df = final_df.append(intermediate)
final_df = final_df.drop_duplicates()
Note that the loop through identifiers is required for my full DataFrame. In this instance, however, it seems rather pointless. This code, nevertheless, results in the following DataFrame (which is how I would like the output to be):
print(final_df)
date Identifier B C Name
0 2019-05-06 1 2.40 0.214 CD
1 2019-05-07 1 3.90 0.985 CD
3 2019-05-09 1 4.30 0.839 CD
4 2019-05-10 1 2.50 0.555 CD
5 2019-05-11 1 3.14 0.159 CD
0 2019-05-06 1 2.40 0.214 AD
1 2019-05-07 1 3.90 0.985 AD
3 2019-05-09 1 4.30 0.839 AD
4 2019-05-10 1 2.50 0.555 AD
5 2019-05-11 1 3.14 0.159 AD
Is there any way to perform this operation with a groupby, or is there any other way to make it faster?
Thanks!
Upvotes: 2
Views: 1575
Reputation: 903
One way to speed up this code by a significant amount is by appending the intermediate DataFrames to a list first, and concatenate the list of DataFrames in one final step using pd.concat()
.
This would make the code look as follows:
final_df = []
for i in df.Identifier.unique():
# select the current identifier
identifier_df = df.loc[df.Identifier == i]
# allow a given identifier to have different names
for n in df.Name.unique():
if pd.isna(n):
continue
else:
intermediate = identifier_df.copy()
intermediate.loc[:,"Name"] = np.repeat(n, len(intermediate))
final_df.append(intermediate)
final_df = pd.concat(final_df).drop_duplicates()
This simple solution made me decrease execution time by a significant margin. Hopefully it helps someone else as well.
Upvotes: 0
Reputation: 863056
Use itertools.product
for all combination of all 3 columns:
from itertools import product
df1 = pd.DataFrame(list(product(df['date'].unique(),
df['Identifier'].unique(),
df['Name'].dropna().unique())),
columns=['date','Identifier','Name'])
print (df1)
date Identifier Name
0 2019-05-06 1 CD
1 2019-05-06 1 AD
2 2019-05-07 1 CD
3 2019-05-07 1 AD
4 2019-05-09 1 CD
5 2019-05-09 1 AD
6 2019-05-10 1 CD
7 2019-05-10 1 AD
8 2019-05-11 1 CD
9 2019-05-11 1 AD
Left join by DataFrame.merge
and create MultiIndex
by DataFrame.set_index
:
df2 = df1.merge(df, how='left').set_index(['date','Identifier'])
Use DataFrame.drop_duplicates
for possible replace missing values by DataFrame.combine_first
:
df3 = df.drop_duplicates(['date','Identifier']).set_index(['date','Identifier'])
print (df3)
B C Name
date Identifier
2019-05-06 1 2.40 0.214 NaN
2019-05-07 1 3.90 0.985 CD
2019-05-09 1 4.30 0.839 NaN
2019-05-10 1 2.50 0.555 NaN
2019-05-11 1 3.14 0.159 NaN
df4 = df2.combine_first(df3).reset_index()
print (df4)
date Identifier B C Name
0 2019-05-06 1 2.40 0.214 CD
1 2019-05-06 1 2.40 0.214 AD
2 2019-05-07 1 3.90 0.985 CD
3 2019-05-07 1 3.90 0.985 AD
4 2019-05-09 1 4.30 0.839 CD
5 2019-05-09 1 4.30 0.839 AD
6 2019-05-10 1 2.50 0.555 CD
7 2019-05-10 1 2.50 0.555 AD
8 2019-05-11 1 3.14 0.159 CD
9 2019-05-11 1 3.14 0.159 AD
Upvotes: 1
Reputation: 75090
From what I understand, if dates are sorted and each date has same length:
from itertools import islice,cycle
m=df.name.isna() #pull where name is NaN
l=df.loc[~m,'name'].tolist() #create a list for not null names
df.loc[m,'name']=list(islice(cycle(l),len(df[m]))) #repeat the list for all dates and assign to NaN
print(df)
date identifier B C name
0 2019-05-07 1 2.4 0.214 AB
1 2019-05-07 1 2.4 0.214 CD
2 2019-05-08 1 3.9 0.985 AB
3 2019-05-08 1 3.9 0.985 CD
4 2019-05-09 1 2.5 0.555 AB
5 2019-05-09 1 2.5 0.555 CD
Upvotes: 2
Reputation: 71590
Try this one-liner concat
, replace
, slicing, and ffill
:
print(pd.concat([df[::2],df[::2].replace('AB','CD')]).ffill())
Output:
date identifier B C name
0 2019-05-07 1 2.4 0.214 AB
2 2019-05-08 1 3.9 0.985 AB
4 2019-05-09 1 2.5 0.555 AB
0 2019-05-07 1 2.4 0.214 CD
2 2019-05-08 1 3.9 0.985 CD
4 2019-05-09 1 2.5 0.555 CD
Upvotes: 0