Reputation: 27
I have a large dataframe containing ∼ 100 million observations of stock returns for a given date and an identifier for each specific stock. My problem is that for some stocks at some dates there are missing values in the return column. To set up a small example the data frame looks something like this:
ID | Date | Return |
---|---|---|
10001 | 1970-01-01 | 0.043 |
10001 | 1970-01-02 | 0.023 |
10001 | 1970-01-03 | -0.010 |
10002 | 1970-01-01 | 0.010 |
10002 | 1970-01-02 | NaN |
10002 | 1970-01-03 | -0.034 |
10003 | 1970-01-01 | 0,040 |
10003 | 1970-01-02 | -0.041 |
10003 | 1970-01-03 | 0.009 |
I wish to remove all the stocks containing one or more missing values in the return column, which I do by:
df = df[df['RET'].notna().groupby('ID').transform('all')]
From the data frame example above, this will generate a grouped data frame without stocks having missing values in the return column:
ID | Date | Return |
---|---|---|
10001 | 1970-01-01 | 0.043 |
1970-01-02 | 0.023 | |
1970-01-03 | -0.010 | |
10003 | 1970-01-01 | 0,040 |
1970-01-02 | -0.041 | |
1970-01-03 | 0.009 |
However, I need to convert the data frame back to being ungrouped again. Do any of you have a solution on how to do this? To specify, I want the data frame to look like this:
ID | Date | Return |
---|---|---|
10001 | 1970-01-01 | 0.043 |
10001 | 1970-01-02 | 0.023 |
10001 | 1970-01-03 | -0.010 |
10003 | 1970-01-01 | 0,040 |
10003 | 1970-01-02 | -0.041 |
10003 | 1970-01-03 | 0.009 |
Thank you in advance for any possible solution.
I have already tried using .unstack() and .reset_index() without any luck
Upvotes: 1
Views: 378
Reputation: 27
If anyone should wonder, this could also be fixed by adding ', group_keys=False' to the groupby function:
df = df[df['RET'].notna().groupby('ID', group_keys=False).transform('all')]
Upvotes: 0
Reputation: 195438
You don't need grouping. Simple boolean indexing + .isin()
is enough:
x = df.loc[df['Return'].isna(), 'ID'].unique()
print(df[~df['ID'].isin(x)])
Prints:
ID Date Return
0 10001 1970-01-01 0.043
1 10001 1970-01-02 0.023
2 10001 1970-01-03 -0.010
6 10003 1970-01-01 0,040
7 10003 1970-01-02 -0.041
8 10003 1970-01-03 0.009
Upvotes: 1