Reputation: 548
I've got a .csv file that looks a bit like this:
COL_A COL_B COL_C
1 2020-05-26T00:01:01 99999
2 2020-05-26T00:01:02 99999
3 2020-05-26T00:01:03 99999
4 2020-05-26T00:01:04 2.3
5 2020-05-26T00:01:05 2.3
6 2020-05-26T00:01:06 2.3
7 2020-05-26T00:01:07 99999
8 2020-05-26T00:01:08 99999
9 2020-05-26T00:01:09 3.4
10 2020-05-26T00:01:10 3.4
11 2020-05-26T00:01:11 99999
12 2020-05-26T00:01:12 99999
I'd like to be able to identify the longest continuous span of rows where COL_C
is < 5
and return that list of rows. The desired output would be something like:
[
[4 2020-05-26T00:01:04 2.3,
5 2020-05-26T00:01:05 2.3,
6 2020-05-26T00:01:06 2.3]
], 3
The way I have approached this in theory is building a list of lists that meet the criteria, and then using max
over the lists with len
as the key. I've attempted this:
import pandas as pd
def max_c(csv_file):
row_list = []
df = pd.read_csv(csv_file)
for i, row in df.iterrows():
while row[2] < 5:
span = [*row]
row_list.append(span)
return max(row_list, key=len)
I know enough to know that this isn't correct syntax for what I'm trying to do and I can even explain why, but do not know enough to get the desired output.
Upvotes: 0
Views: 50
Reputation: 150785
I'll use cumsum()
to identify blocks and do a groupby:
s = df['COL_C'].lt(5)
sizes = s.groupby([s,(~s).cumsum()]).transform('size') * s
# max block 1 size
# max_size == 0 means all values are >= 5
max_size = sizes.max()
df[sizes==max_size]
Output:
COL_A COL_B COL_C
3 4 2020-05-26T00:01:04 2.3
4 5 2020-05-26T00:01:05 2.3
5 6 2020-05-26T00:01:06 2.3
Details:
s
is:
0 False
1 False
2 False
3 True
4 True
5 True
6 False
7 False
8 True
9 True
10 False
11 False
Name: COL_C, dtype: bool
if we just do s.cumsum()
then the True
obviously belong to different groups. Instead we do (~s).cumsum()
we get:
0 1
1 2
2 3
3 3
4 3
5 3
6 4
7 5
8 5
9 5
10 6
11 7
Name: COL_C, dtype: int64
Almost there, but each group of True
is now preceded by a row of False
. That suggests we group by both s
and the negated cumsum.
Upvotes: 1
Reputation: 323326
Similar to Quang ,find the greater than 5 and create the sub-group, then we just filter out he value more than 5, and get the group with transform
count
. pick the max
count index
s=df.COL_C.ge(5)
s=df.loc[~s,'COL_A'].groupby(s.cumsum()).transform('count')
target=df.loc[s[s==s.max()].index]
Out[299]:
COL_A COL_B COL_C
3 4 2020-05-26T00:01:04 2.3
4 5 2020-05-26T00:01:05 2.3
5 6 2020-05-26T00:01:06 2.3
Upvotes: 2