Reputation: 35
So the point is to find the Person who entered on 3 consecutive dates. My frame looks like this:
DateEntry Person
1 2018-03-18 A
2 2018-03-19 A
3 2018-03-21 A
4 2018-09-25 B
5 2018-09-26 B
6 2018-09-27 B
The only way I know how to check this is by changing the dates into a list of strings and then checking. It works fine, but, this method is not allowed.
Is there a way to iterate through the rows on a dataframe via pandas to find the answer?
I'm only expected to show the following output. I do not need to save the results within the dataframe. Expected output:
Person A did not enter on 3 consecutive days.
Person B did enter on three consecutive days.
Consecutive days entered by person B:
2018-09-25
2018-09-26
2018-09-27
Upvotes: 1
Views: 214
Reputation: 862406
If datetimes are sorted and expected output is only check if 3 consecutive dates in days use strides in custom function in GroupBy.apply
:
df['DateEntry'] = pd.to_datetime(df['DateEntry'])
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
def f(x):
vals = rolling_window(x.to_numpy(), 3)
dif = np.diff(vals, axis=1).astype("timedelta64[D]")==np.array([1], dtype='timedelta64[D]')
#print (dif)
return dif.all(axis=1).any()
s = df.groupby('Person')['DateEntry'].apply(f)
print (s)
Person
A False
B True
Name: DateEntry, dtype: bool
If need also datetimes:
print (df)
DateEntry Person
1 2018-03-18 A
2 2018-03-19 A
3 2018-03-21 A
4 2018-08-25 B
5 2018-08-26 B
6 2018-08-27 B
7 2018-09-25 B
8 2018-09-26 B
9 2018-09-27 B
10 2018-09-30 B
df['DateEntry'] = pd.to_datetime(df['DateEntry'])
df = df.sort_values(['Person','DateEntry'])
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
def f(x):
vals = rolling_window(x.to_numpy(), 3)
dif = np.diff(vals, axis=1).astype("timedelta64[D]")==np.array([1], dtype='timedelta64[D]')
return pd.DataFrame(vals[dif.all(axis=1)])
df1 = df.groupby('Person')['DateEntry'].apply(f)
print (df1)
0 1 2
Person
B 0 2018-08-25 2018-08-26 2018-08-27
1 2018-09-25 2018-09-26 2018-09-27
Upvotes: 1
Reputation: 2710
Try the below code:
import pandas as pd
from datetime import datetime
data =[['2018-03-18', 'A'],
['2018-03-19', 'A'],
['2018-03-21', 'A'],
['2018-09-25', 'B'],
['2018-09-26', 'B'],
['2018-09-27', 'B']]
df=pd.DataFrame(data, columns = ['DateEntry', 'Person'])
Person = None
Date = None
count = 0
for index, row in df.iterrows():
if Person:
if Person == row['Person']:
count += 1
else:
Person = row['Person']
Date = None
count = 0
else:
Person = row['Person']
count += 1
if Date:
if (datetime.strptime(row['DateEntry'], "%Y-%m-%d") - datetime.strptime(Date, "%Y-%m-%d")).days == 1:
Date = row['DateEntry']
if count == 3:
print('The consecutive visitor is -- {}'.format(Person))
else:
Person = None
Date = None
count = 0
else:
Date = row['DateEntry']
Upvotes: 0