Rabab Alkhalifa
Rabab Alkhalifa

Reputation: 33

time series sliding window with occurrence counts

I am trying to get a count between two timestamped values:

for example:

time    letter
  1     A
  4     B
  5     C
  9     C
  18    B
  30    A
  30    B

I am dividing time to time windows: 1+ 30 / 30 then I want to know how many A B C in each time window of size 1

timeseries  A  B  C
1           1  0  0
2           0  0  0
...
30          1  1  0

this shoud give me a table of 30 rows and 3 columns: A B C of ocurancess

The problem is the data is taking to long to be break down because it iterates through all master table every time to slice the data eventhough thd data is already sorted

master = mytable  

minimum = master.timestamp.min()
maximum = master.timestamp.max()

window = (minimum + maximum) / maximum

wstart = minimum
wend = minimum + window

concurrent_tasks = []

while ( wstart <= maximum ):
    As = 0
    Bs = 0
    Cs = 0
    for d, row in master.iterrows():
        ttime = row.timestamp
        if ((ttime >= wstart) & (ttime < wend)):
            #print (row.channel)
            if (row.channel == 'A'):
                As = As + 1
            elif (row.channel == 'B'):
                Bs = Bs + 1
            elif (row.channel == 'C'):
                Cs = Cs + 1


    concurrent_tasks.append([m_id, As, Bs, Cs])

    wstart = wstart + window
    wend = wend + window

Could you help me in making this perform better ? i want to use map function and i want to prevent python from looping through all the loop every time.

This is part of big data and it taking days to finish ?

thank you

Upvotes: 1

Views: 997

Answers (1)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210852

There is a faster approach - pd.get_dummies():

In [116]: pd.get_dummies(df.set_index('time')['letter'])
Out[116]:
      A  B  C
time
1     1  0  0
4     0  1  0
5     0  0  1
9     0  0  1
18    0  1  0
30    1  0  0
30    0  1  0

If you want to "compress" (group) it by time:

In [146]: pd.get_dummies(df.set_index('time')['letter']).groupby(level=0).sum()
Out[146]:
      A  B  C
time
1     1  0  0
4     0  1  0
5     0  0  1
9     0  0  1
18    0  1  0
30    1  1  0

or using sklearn.feature_extraction.text.CountVectorizer:

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(token_pattern=r"\b\w+\b", stop_words=None)

r = pd.SparseDataFrame(cv.fit_transform(df.groupby('time')['letter'].agg(' '.join)),
                       index=df['time'].unique(),
                       columns=df['letter'].unique(),
                       default_fill_value=0)

Result:

In [143]: r
Out[143]:
    A  B  C
1   1  0  0
4   0  1  0
5   0  0  1
9   0  0  1
18  0  1  0
30  1  1  0

If we want to list all times from 1 to 30:

In [153]: r.reindex(np.arange(r.index.min(), r.index.max()+1)).fillna(0).astype(np.int8)
Out[153]:
    A  B  C
1   1  0  0
2   0  0  0
3   0  0  0
4   0  1  0
5   0  0  1
6   0  0  0
7   0  0  0
8   0  0  0
9   0  0  1
10  0  0  0
11  0  0  0
12  0  0  0
13  0  0  0
14  0  0  0
15  0  0  0
16  0  0  0
17  0  0  0
18  0  1  0
19  0  0  0
20  0  0  0
21  0  0  0
22  0  0  0
23  0  0  0
24  0  0  0
25  0  0  0
26  0  0  0
27  0  0  0
28  0  0  0
29  0  0  0
30  1  1  0

or using Pandas approach:

In [159]: pd.get_dummies(df.set_index('time')['letter']) \
     ...:   .groupby(level=0) \
     ...:   .sum() \
     ...:   .reindex(np.arange(r.index.min(), r.index.max()+1), fill_value=0)
     ...:
Out[159]:
      A  B  C
time
1     1  0  0
2     0  0  0
3     0  0  0
4     0  1  0
5     0  0  1
6     0  0  0
7     0  0  0
8     0  0  0
9     0  0  1
10    0  0  0
...  .. .. ..
21    0  0  0
22    0  0  0
23    0  0  0
24    0  0  0
25    0  0  0
26    0  0  0
27    0  0  0
28    0  0  0
29    0  0  0
30    1  1  0

[30 rows x 3 columns]

UPDATE:

Timing:

In [163]: df = pd.concat([df] * 10**4, ignore_index=True)

In [164]: %timeit pd.get_dummies(df.set_index('time')['letter'])
100 loops, best of 3: 10.9 ms per loop

In [165]: %timeit df.set_index('time').letter.str.get_dummies()
1 loop, best of 3: 914 ms per loop

Upvotes: 3

Related Questions