nikki
nikki

Reputation: 105

python split the column values of a dataframe

I am working on python and new to it.I have a dataframe as

Date           Emailable    Lost_Fans   New_Fans    Country
12-10-2020      121134                                 JP
06-11-2020      120859         350         75          JP
18-12-2020      101857        19128       126          JP
29-01-2021      105760         5029      8932          JP
16-02-2021      115437         1279      10956         JP
01-03-2021      115512          517       592          JP    
12-07-2021      137546         18074     40108         JP
11-08-2021      134319         23753     20526         JP
02-09-2021      130156          4355      192          JP
  1. First,I want to find the missing month from the dataframe
  2. to fill the missing month data for lost and new split the last month data value of lost and new into these missing month lost and new column. For example: the april, may june is missing month from the above dataframe and the next data available is for july which contains 18074 in lost and 40108 in new. we need to spilt these values into the missing month(april, may, june) including july.so for lost 18074/4 =4518 to each of the missing month including july. so the final output will be like.

output should be:

Date           Emailable    Lost_Fans   New_Fans    Country

12-10-2020      121134                                 JP
06-11-2020      120859         350         75          JP
18-12-2020      101857        19128       126          JP
29-01-2021      105760         5029      8932          JP
16-02-2021      115437         1279      10956         JP
01-03-2021      115512          517       592          JP 
30-04-2021       0              4518     10027         JP
31-05-2021       0              4518     10027         JP
30-06-2021       0              4518     10027         JP
12-07-2021      137546         4518      10027         JP
11-08-2021      134319         23753     20526         JP
02-09-2021      130156          4355      192          JP

Upvotes: 0

Views: 107

Answers (2)

w-m
w-m

Reputation: 11232

The idea of the code is first to create rows of NaNs for the missing months. Then to fill in the rows by using the following value (bfill). And then divide the result either by 1 if there was no filling involved, or by the number of consecutive NaNs + 1:

# If Date is a string, convert to datetime:
df.Date = pd.to_datetime(df.Date, format="%d-%m-%Y")

# make the month the index 
df.set_index(df.Date.dt.to_period("M"), inplace=True)

# add rows for missing months
df = df.reindex(pd.period_range(df.index[0], df.index[-1], freq='M'))

# group the consecutive NaNs
group_nans = df.Lost_Fans.isna().groupby(df.Lost_Fans.notna().cumsum())

# for each NaN group, broadcast their length & shift it one row lower
consecutive_nans = group_nans.transform("sum").shift(fill_value=0).astype(int)
divisor = consecutive_nans + 1

# fill NaNs backwards, and divide by number of consecutive nans + 1    
df.Lost_Fans = df.Lost_Fans.fillna(method="bfill").astype(int) // divisor
df.New_Fans = df.New_Fans.fillna(method="bfill").astype(int) // divisor

df.Emailable = df.Emailable.fillna(0).astype(int)
df.Country = df.Country.fillna(method="bfill")

The resulting DataFrame:

              Date  Emailable  Lost_Fans  New_Fans Country
2020-10 2020-10-12     121134          0         0      JP
2020-11 2020-11-06     120859        350        75      JP
2020-12 2020-12-18     101857      19128       126      JP
2021-01 2021-01-29     105760       5029      8932      JP
2021-02 2021-02-16     115437       1279     10956      JP
2021-03 2021-03-01     115512        517       592      JP
2021-04        NaT          0       4518     10027      JP
2021-05        NaT          0       4518     10027      JP
2021-06        NaT          0       4518     10027      JP
2021-07 2021-07-12     137546       4518     10027      JP
2021-08 2021-08-11     134319      23753     20526      JP
2021-09 2021-09-02     130156       4355       192      JP

You can replace the missing (NaT) values for the Date column for example like this:

import numpy as np

# replace NaTs with a timestamp from the month-index
# it will use the first day of the month
df.Date = np.where(df.Date.isnull(), df.index.to_timestamp(), df.Date)

Upvotes: 1

Nimrat Atwal
Nimrat Atwal

Reputation: 21

  1. To find the missing months, use the following code:

    
         months = []
         miss_months = []
         for i in range(len(df.Date)):
             if df.Date[i].split('-')[1][0] != '0':
                 months.append(df.Date[i].split('-')[1])
             else:
                 months.append(df.Date[i].split('-')[1][1])
         for j in range(1,13):
             if str(j) in months:
                 pass
             else:
                 miss_months.append(j)
         print(miss_months)
     

Here while extracting the month from the dataframe, I noticed that all the retrieved months were 2 character strings. For example: '07','08','11',etc. So I wrote some code to check if the first character of the retrieved is 0 or not.

  1. To get the final output, use the following code:

    
         def Insert_row(row_number, df, row_value):
             start_upper = 0
             end_upper = row_number
             start_lower = row_number
             end_lower = df.shape[0]
             upper_half = [*range(start_upper, end_upper, 1)]
             lower_half = [*range(start_lower, end_lower, 1)]
             lower_half = [x.__add__(1) for x in lower_half]
             index_ = upper_half + lower_half
             df.index = index_
             df.loc[row_number] = row_value
             df = df.sort_index()
             return df
         row_number = [6,7,8]
         the_date = ['30-04-2021','31-05-2021','30-06-2021']
         for k in range(len(row_number)):
             row_value = [the_date[k],0,18074//4,10027,'JP']
             df = Insert_row(row_number[k], df, row_value)
         print(df)
     

Here, I have created a function Insert_row() to make the row adding easier.

Upvotes: 2

Related Questions