Stan
Stan

Reputation: 884

Playing with a Pandas Dataframe with Time Column

I am working on a big pandas dataframe which has a time column ( sorted) and what I want is to drop any dates which are duplicates and within a delta of 1 day (say). Apart from that what I am also looking for is to create another column as label (based on the time) . So to give a simple example ( for simplicity I just kept the time column here:

    df:
          
            Time Column

            2020-12-05
            2020-12-09
            2020-12-13
            2020-12-18
            2020-12-21
            2020-12-21
            2020-12-23
            2020-12-24
            2020-12-31
            2021-01-04

So what I am looking for in an output dataframe which looks like this ( after dropping duplicates and dates within a day and keeping the first occurence only - in the example I had 2020-12-23 & 2020-12-24 and kept date 23 only) and creating a label Column:

       df:
          
            Time Column        Label

            2020-12-05          L1
            2020-12-09          L2
            2020-12-13          L3
            2020-12-18          L4
            2020-12-21          L5
            2020-12-23          L6
            2020-12-31          L7
            2021-01-04          L8

Not getting any idea how to achieve this. Any help in this direction will be immensely appreciated.

Upvotes: 1

Views: 42

Answers (3)

Stan
Stan

Reputation: 884

I did something like this:

  df = pd.DataFrame(dt)
  df['time_column'] = pd.to_datetime(df['time_column'])
  mdf = df.copy()

  mdf = mdf.rename(columns={'time_column': 'date_column'})
  myf = df.diff(periods=1)
  finDF = myf.join(mdf,how='inner')

  mask = ['0 days', '1 days']
  finDF = finDF[~finDF['time_column'].isin(mask)]

  del finDF['time_column']

  finDF=finDF.assign(Label= 
      'L'+finDF['date_column'].notna().cumsum().astype(str))
 finDF

Upvotes: 0

wwnde
wwnde

Reputation: 26676

df['Time_Column']=pd.to_datetime(df['Time_Column'])#Coerce Time_Column to date
df=df[df['Time_Column'].diff(1).dt.days.ne(1)].drop_duplicates()#Remove consecutive dates and duplicates
df=df.assign(Label= 'L'+df['Time_Column'].notna().cumsum().astype(str))#Cumsum the Time_Column
print(df)



   Time_Column Label
0  2020-12-05    L1
1  2020-12-09    L2
2  2020-12-13    L3
3  2020-12-18    L4
4  2020-12-21    L5
6  2020-12-23    L6
8  2020-12-31    L7
9  2021-01-04    L8

Upvotes: 1

Ian Wright
Ian Wright

Reputation: 346

  • create a new column which extracts the date from the time column
  • drop duplicates on this new column
  • remove temporary column
(df.assign(dt_day=lambda x: x['time column'].dt.date)
   .drop_duplicates('dt_day', keep='first')
   .drop(columns=['dt_day']))

Upvotes: 0

Related Questions