Reputation: 105
I am working on python and new to it.I have a dataframe as
Date Emailable Lost_Fans New_Fans Country
12-10-2020 121134 JP
06-11-2020 120859 350 75 JP
18-12-2020 101857 19128 126 JP
29-01-2021 105760 5029 8932 JP
16-02-2021 115437 1279 10956 JP
01-03-2021 115512 517 592 JP
12-07-2021 137546 18074 40108 JP
11-08-2021 134319 23753 20526 JP
02-09-2021 130156 4355 192 JP
output should be:
Date Emailable Lost_Fans New_Fans Country
12-10-2020 121134 JP
06-11-2020 120859 350 75 JP
18-12-2020 101857 19128 126 JP
29-01-2021 105760 5029 8932 JP
16-02-2021 115437 1279 10956 JP
01-03-2021 115512 517 592 JP
30-04-2021 0 4518 10027 JP
31-05-2021 0 4518 10027 JP
30-06-2021 0 4518 10027 JP
12-07-2021 137546 4518 10027 JP
11-08-2021 134319 23753 20526 JP
02-09-2021 130156 4355 192 JP
Upvotes: 0
Views: 107
Reputation: 11232
The idea of the code is first to create rows of NaNs for the missing months. Then to fill in the rows by using the following value (bfill). And then divide the result either by 1 if there was no filling involved, or by the number of consecutive NaNs + 1:
# If Date is a string, convert to datetime:
df.Date = pd.to_datetime(df.Date, format="%d-%m-%Y")
# make the month the index
df.set_index(df.Date.dt.to_period("M"), inplace=True)
# add rows for missing months
df = df.reindex(pd.period_range(df.index[0], df.index[-1], freq='M'))
# group the consecutive NaNs
group_nans = df.Lost_Fans.isna().groupby(df.Lost_Fans.notna().cumsum())
# for each NaN group, broadcast their length & shift it one row lower
consecutive_nans = group_nans.transform("sum").shift(fill_value=0).astype(int)
divisor = consecutive_nans + 1
# fill NaNs backwards, and divide by number of consecutive nans + 1
df.Lost_Fans = df.Lost_Fans.fillna(method="bfill").astype(int) // divisor
df.New_Fans = df.New_Fans.fillna(method="bfill").astype(int) // divisor
df.Emailable = df.Emailable.fillna(0).astype(int)
df.Country = df.Country.fillna(method="bfill")
The resulting DataFrame:
Date Emailable Lost_Fans New_Fans Country
2020-10 2020-10-12 121134 0 0 JP
2020-11 2020-11-06 120859 350 75 JP
2020-12 2020-12-18 101857 19128 126 JP
2021-01 2021-01-29 105760 5029 8932 JP
2021-02 2021-02-16 115437 1279 10956 JP
2021-03 2021-03-01 115512 517 592 JP
2021-04 NaT 0 4518 10027 JP
2021-05 NaT 0 4518 10027 JP
2021-06 NaT 0 4518 10027 JP
2021-07 2021-07-12 137546 4518 10027 JP
2021-08 2021-08-11 134319 23753 20526 JP
2021-09 2021-09-02 130156 4355 192 JP
You can replace the missing (NaT) values for the Date
column for example like this:
import numpy as np
# replace NaTs with a timestamp from the month-index
# it will use the first day of the month
df.Date = np.where(df.Date.isnull(), df.index.to_timestamp(), df.Date)
Upvotes: 1
Reputation: 21
To find the missing months, use the following code:
months = []
miss_months = []
for i in range(len(df.Date)):
if df.Date[i].split('-')[1][0] != '0':
months.append(df.Date[i].split('-')[1])
else:
months.append(df.Date[i].split('-')[1][1])
for j in range(1,13):
if str(j) in months:
pass
else:
miss_months.append(j)
print(miss_months)
Here while extracting the month from the dataframe, I noticed that all the retrieved months were 2 character strings. For example: '07','08','11',etc. So I wrote some code to check if the first character of the retrieved is 0 or not.
To get the final output, use the following code:
def Insert_row(row_number, df, row_value):
start_upper = 0
end_upper = row_number
start_lower = row_number
end_lower = df.shape[0]
upper_half = [*range(start_upper, end_upper, 1)]
lower_half = [*range(start_lower, end_lower, 1)]
lower_half = [x.__add__(1) for x in lower_half]
index_ = upper_half + lower_half
df.index = index_
df.loc[row_number] = row_value
df = df.sort_index()
return df
row_number = [6,7,8]
the_date = ['30-04-2021','31-05-2021','30-06-2021']
for k in range(len(row_number)):
row_value = [the_date[k],0,18074//4,10027,'JP']
df = Insert_row(row_number[k], df, row_value)
print(df)
Here, I have created a function Insert_row() to make the row adding easier.
Upvotes: 2