How to extract the first and last value from a data sequence based on a column value?

Question

I have a time series dataset that can be created with the following code.

idx = pd.date_range("2018-01-01", periods=100, freq="H")
ts = pd.Series(idx)
dft = pd.DataFrame(ts,columns=["date"])
dft["data"] = ""

dft["data"][0:5]= "a"
dft["data"][5:15]= "b"
dft["data"][15:20]= "c"
dft["data"][20:30]= "d"
dft["data"][30:40]= "a"
dft["data"][40:70]= "c"
dft["data"][70:85]= "b"
dft["data"][85:len(dft)]= "c"

In the data column, the unique values are a,b,c,d. These values are repeating in a sequence in different time windows. I want to capture the first and last value of that time window. How can I do that?

mozway · Accepted Answer

Compute a grouper for your changing values using shift to compare consecutive rows, then use groupby+agg to get the min/max per group:

group = dft.data.ne(dft.data.shift()).cumsum()
dft.groupby(group)['date'].agg(['min', 'max'])

output:

                     min                 max
data                                        
1    2018-01-01 00:00:00 2018-01-01 04:00:00
2    2018-01-01 05:00:00 2018-01-01 14:00:00
3    2018-01-01 15:00:00 2018-01-01 19:00:00
4    2018-01-01 20:00:00 2018-01-02 05:00:00
5    2018-01-02 06:00:00 2018-01-02 15:00:00
6    2018-01-02 16:00:00 2018-01-03 21:00:00
7    2018-01-03 22:00:00 2018-01-04 12:00:00
8    2018-01-04 13:00:00 2018-01-05 03:00:00

edit. combining with original data:

dft.groupby(group).agg({'data': 'first', 'date': ['min', 'max']})

output:

      data                date                    
     first                 min                 max
data                                              
1        a 2018-01-01 00:00:00 2018-01-01 04:00:00
2        b 2018-01-01 05:00:00 2018-01-01 14:00:00
3        c 2018-01-01 15:00:00 2018-01-01 19:00:00
4        d 2018-01-01 20:00:00 2018-01-02 05:00:00
5        a 2018-01-02 06:00:00 2018-01-02 15:00:00
6        c 2018-01-02 16:00:00 2018-01-03 21:00:00
7        b 2018-01-03 22:00:00 2018-01-04 12:00:00
8        c 2018-01-04 13:00:00 2018-01-05 03:00:00

How to extract the first and last value from a data sequence based on a column value?

Answers (1)

edit. combining with original data:

Related Questions