Souvik Ray
Souvik Ray

Reputation: 3018

How to get a count of certain entity per minute in a given dataframe in pandas?

I have a Dataframe that looks like below

data = [(datetime.datetime(2021, 2, 10, 7, 49, 7, 118658), u'12.100.90.10', u'100.100.12.1', u'LT_DOWN'),
       (datetime.datetime(2021, 2, 10, 7, 49, 14, 312273), u'12.100.90.10', u'100.100.12.1', u'LT_UP'),
       (datetime.datetime(2021, 2, 10, 7, 49, 21, 535932), u'12.100.90.10', u'100.100.12.1', u'LT_UP'),
       (datetime.datetime(2021, 2, 10, 7, 50, 28, 725961), u'12.100.90.10', u'100.100.12.1', u'PL_DOWN'),
       (datetime.datetime(2021, 2, 10, 7, 50, 32, 450853), u'10.100.80.10', u'10.55.10.1', u'PL_LOW'),
       (datetime.datetime(2021, 2, 10, 7, 51, 32, 450853), u'10.10.80.10', u'10.55.10.1', u'MA_HIGH'),
       (datetime.datetime(2021, 2, 10, 7, 52, 34, 264042), u'10.10.80.10', u'10.55.10.1', u'PL_DOWN')]

As you can see there is data getting logged per minute. I have just presented part of the complete data here.

This is how it looks on loading it in pandas

                        date         start           end     type
0 2021-02-10 07:49:07.118658  12.100.90.10  100.100.12.1  LT_DOWN
1 2021-02-10 07:49:14.312273  12.100.90.10  100.100.12.1    LT_UP
2 2021-02-10 07:49:21.535932  12.100.90.10  100.100.12.1    LT_UP
3 2021-02-10 07:50:28.725961  12.100.90.10  100.100.12.1  PL_DOWN
4 2021-02-10 07:50:32.450853  10.100.80.10    10.55.10.1   PL_LOW
5 2021-02-10 07:51:32.450853   10.10.80.10    10.55.10.1  MA_HIGH
6 2021-02-10 07:52:34.264042   10.10.80.10    10.55.10.1  PL_DOWN

Now what I want is two things

First, get count of each value in type per minute and furthermore only first part of _ split in value should be considered for count. So my final output should be something like this

              date LT PL  MA
0 2021-02-10 07:49 3  0   0
1 2021-02-10 07:50 0  2   0
2 2021-02-10 07:51 0  0   1
3 2021-02-10 07:52 0  1   0

Second convert any non zero to True and zero to False. So final output would look like

              date LT    PL     MA
0 2021-02-10 07:49 True  False  False
1 2021-02-10 07:50 False True   False
2 2021-02-10 07:51 False False  True
3 2021-02-10 07:52 False True   False

Now regarding the first part, I can find the count of each value (split on first part of _) on the total Dataframe by doing something like this

df = pd.DataFrame(data)
df.columns = ["date", "start", "end", "type"]
df = df.drop_duplicates()
issue = "LT"
df_check = df[df['type'].str.contains(issue)].groupby(['start', 'end']).size().to_frame('icount').reset_index()

But how do I do it on a minute basis and map it to True and False?

Upvotes: 1

Views: 172

Answers (1)

Sayandip Dutta
Sayandip Dutta

Reputation: 15872

In the data you provided, last item in type is MA_DOWN, but in the dataframe and the results it is PL_DOWN, I think it was an error on your part. I have considered the dataframe version here.

You can use pandas.DataFrame.resample with pandas.get_dummies:

>>> pd.get_dummies(df['type'].str.extract(r'(\w+)_', expand=False)
                   ).set_index(df['date']).resample('1T').sum()
                     LT  MA  PL
date                           
2021-02-10 07:49:00   3   0   0
2021-02-10 07:50:00   0   0   2
2021-02-10 07:51:00   0   1   0
2021-02-10 07:52:00   0   0   1

Then astype(bool):

>>> pd.get_dummies(df['type'].str.extract(r'(\w+)_', expand=False)
                   ).set_index(df['date']).resample('1T').sum().astype(bool)

                        LT     MA     PL
date                                    
2021-02-10 07:49:00   True  False  False
2021-02-10 07:50:00  False  False   True
2021-02-10 07:51:00  False   True  False
2021-02-10 07:52:00  False  False   True

According to your comments, you can use pd.crosstab:

>>> pd.crosstab(
       index=df['date'].dt.floor('1min'), 
       columns=[
           df['start'].add('-').add(df['end']).rename('star-end'), 
           df['type'].str.extract(r'(\w+)_', expand=False)
       ], 
       dropna=False
    )

start-end           10.10.80.10-10.55.10.1       10.100.80.10-10.55.10.1       12.100.90.10-100.100.12.1      
type                                    LT MA PL                      LT MA PL                        LT MA PL
date                                                                                                          
2021-02-10 07:49:00                      0  0  0                       0  0  0                         3  0  0
2021-02-10 07:50:00                      0  0  0                       0  0  1                         0  0  1
2021-02-10 07:51:00                      0  1  0                       0  0  0                         0  0  0
2021-02-10 07:52:00                      0  0  1                       0  0  0                         0  0  0

Or,

>>> pd.crosstab(
       index=df['date'].dt.floor('1min'), 
       columns=[
           df['start'], 
           df['end'], 
           df['type'].str.extract(r'(\w+)_', expand=False)
      ], 
      dropna=False
    )
 
start               10.10.80.10                          10.100.80.10                          12.100.90.10                         
end                  10.55.10.1       100.100.12.1         10.55.10.1       100.100.12.1         10.55.10.1       100.100.12.1      
type                         LT MA PL           LT MA PL           LT MA PL           LT MA PL           LT MA PL           LT MA PL
date                                                                                                                                
2021-02-10 07:49:00           0  0  0            0  0  0            0  0  0            0  0  0            0  0  0            3  0  0
2021-02-10 07:50:00           0  0  0            0  0  0            0  0  1            0  0  0            0  0  0            0  0  1
2021-02-10 07:51:00           0  1  0            0  0  0            0  0  0            0  0  0            0  0  0            0  0  0
2021-02-10 07:52:00           0  0  1            0  0  0            0  0  0            0  0  0            0  0  0            0  0  0

Upvotes: 1

Related Questions