Menno Van Dijk
Menno Van Dijk

Reputation: 903

Copy DataFrame with NaN values in Column

I have a DataFrame that looks like the example below.

# define DataFrame for reproducability

df = pd.DataFrame({'date': ['2019-05-06', '2019-05-07', '2019-05-07', '2019-05-09', '2019-05-10', '2019-05-11'],
                   'Identifier': [1, 1, 1, 1, 1, 1],
                   'B': [2.4, 3.9, 3.9, 4.3, 2.5, 3.14],
                   'C': [0.214, 0.985, 0.985, 0.839, 0.555, 0.159],
                   'Name': [np.nan, "CD", "AD", np.nan, np.nan, np.nan]})

print(df)

    date        Identifier  B       C       Name
0   2019-05-06  1           2.40    0.214   NaN
1   2019-05-07  1           3.90    0.985   CD
2   2019-05-07  1           3.90    0.985   AD
3   2019-05-09  1           4.30    0.839   NaN
4   2019-05-10  1           2.50    0.555   NaN
5   2019-05-11  1           3.14    0.159   NaN

What can be seen is that, for a given identifier, there can be more than one name. However, the name is only appended to the DataFrame once at a single date. What I need is to basically forward and backward fill the names at every date. Currently, I have a solution that works, but that is extremely slow for the full dataframe that I am working on. The code is shown below

final_df = pd.DataFrame()

for i in df.Identifier.unique():
    # select the current identifier
    identifier_df = df.loc[df.Identifier == i]
    # allow a given identifier to have different names
    for n in df.Name.unique():
        if pd.isna(n):
            continue
        else:
            intermediate = identifier_df.copy()
            intermediate.loc[:,"Name"] = np.repeat(n, len(intermediate))
            final_df = final_df.append(intermediate)

final_df = final_df.drop_duplicates()

Note that the loop through identifiers is required for my full DataFrame. In this instance, however, it seems rather pointless. This code, nevertheless, results in the following DataFrame (which is how I would like the output to be):

print(final_df)

    date        Identifier  B       C       Name
0   2019-05-06  1           2.40    0.214   CD
1   2019-05-07  1           3.90    0.985   CD
3   2019-05-09  1           4.30    0.839   CD
4   2019-05-10  1           2.50    0.555   CD
5   2019-05-11  1           3.14    0.159   CD
0   2019-05-06  1           2.40    0.214   AD
1   2019-05-07  1           3.90    0.985   AD
3   2019-05-09  1           4.30    0.839   AD
4   2019-05-10  1           2.50    0.555   AD
5   2019-05-11  1           3.14    0.159   AD

Is there any way to perform this operation with a groupby, or is there any other way to make it faster?

Thanks!

Upvotes: 2

Views: 1575

Answers (4)

Menno Van Dijk
Menno Van Dijk

Reputation: 903

One way to speed up this code by a significant amount is by appending the intermediate DataFrames to a list first, and concatenate the list of DataFrames in one final step using pd.concat().

This would make the code look as follows:

final_df = []

for i in df.Identifier.unique():
    # select the current identifier
    identifier_df = df.loc[df.Identifier == i]
    # allow a given identifier to have different names
    for n in df.Name.unique():
        if pd.isna(n):
            continue
        else:
            intermediate = identifier_df.copy()
            intermediate.loc[:,"Name"] = np.repeat(n, len(intermediate))
            final_df.append(intermediate)


final_df = pd.concat(final_df).drop_duplicates()

This simple solution made me decrease execution time by a significant margin. Hopefully it helps someone else as well.

Upvotes: 0

jezrael
jezrael

Reputation: 863056

Use itertools.product for all combination of all 3 columns:

from  itertools import product

df1 = pd.DataFrame(list(product(df['date'].unique(), 
                                df['Identifier'].unique(),
                                df['Name'].dropna().unique())), 
                   columns=['date','Identifier','Name'])
print (df1)
         date  Identifier Name
0  2019-05-06           1   CD
1  2019-05-06           1   AD
2  2019-05-07           1   CD
3  2019-05-07           1   AD
4  2019-05-09           1   CD
5  2019-05-09           1   AD
6  2019-05-10           1   CD
7  2019-05-10           1   AD
8  2019-05-11           1   CD
9  2019-05-11           1   AD

Left join by DataFrame.merge and create MultiIndex by DataFrame.set_index:

df2 = df1.merge(df, how='left').set_index(['date','Identifier'])

Use DataFrame.drop_duplicates for possible replace missing values by DataFrame.combine_first:

df3 = df.drop_duplicates(['date','Identifier']).set_index(['date','Identifier'])
print (df3)
                          B      C Name
date       Identifier                  
2019-05-06 1           2.40  0.214  NaN
2019-05-07 1           3.90  0.985   CD
2019-05-09 1           4.30  0.839  NaN
2019-05-10 1           2.50  0.555  NaN
2019-05-11 1           3.14  0.159  NaN

df4 = df2.combine_first(df3).reset_index()
print (df4)
         date  Identifier     B      C Name
0  2019-05-06           1  2.40  0.214   CD
1  2019-05-06           1  2.40  0.214   AD
2  2019-05-07           1  3.90  0.985   CD
3  2019-05-07           1  3.90  0.985   AD
4  2019-05-09           1  4.30  0.839   CD
5  2019-05-09           1  4.30  0.839   AD
6  2019-05-10           1  2.50  0.555   CD
7  2019-05-10           1  2.50  0.555   AD
8  2019-05-11           1  3.14  0.159   CD
9  2019-05-11           1  3.14  0.159   AD

Upvotes: 1

anky
anky

Reputation: 75090

From what I understand, if dates are sorted and each date has same length:

from itertools import islice,cycle
m=df.name.isna() #pull where name is NaN
l=df.loc[~m,'name'].tolist() #create a list for not null names
df.loc[m,'name']=list(islice(cycle(l),len(df[m]))) #repeat the list for all dates and assign to NaN
print(df)

         date  identifier    B      C name
0  2019-05-07           1  2.4  0.214   AB
1  2019-05-07           1  2.4  0.214   CD
2  2019-05-08           1  3.9  0.985   AB
3  2019-05-08           1  3.9  0.985   CD
4  2019-05-09           1  2.5  0.555   AB
5  2019-05-09           1  2.5  0.555   CD

Upvotes: 2

U13-Forward
U13-Forward

Reputation: 71590

Try this one-liner concat, replace, slicing, and ffill:

print(pd.concat([df[::2],df[::2].replace('AB','CD')]).ffill())

Output:

         date  identifier    B      C name
0  2019-05-07           1  2.4  0.214   AB
2  2019-05-08           1  3.9  0.985   AB
4  2019-05-09           1  2.5  0.555   AB
0  2019-05-07           1  2.4  0.214   CD
2  2019-05-08           1  3.9  0.985   CD
4  2019-05-09           1  2.5  0.555   CD

Upvotes: 0

Related Questions