Reputation: 117
I have a dataframe that looks something like this:
df = pd.DataFrame({'a1': [1,2,3,4],
'des_2': ['a','b','a', 'd'],
'des_4': ['a','c','c', 'd'],
'des_1': ['a','b','c', 'd'],
'des_3': ['a','b','c', 'a'],
'a2': [1,2,3,4],
'a3': [1,2,3,4]
})
And I want to create a new column that shows the number of adjacent repeats of the first value, taking into account only the columns with 'des_', ordered alphabetically.
In the first row, all occurences are equal to a, so the count adds up to 4. In the second row, des_1, des_2, and des_3 are equal to b, so the cound adds up to 3. And so on.
a1 des_2 des_4 des_1 des_3 a2 a3 count
0 1 a a a a 1 1 4
1 2 b c b b 2 2 3
2 3 a c c c 3 3 1
3 4 d d d a 4 4 2
I have a working code, but I feel it's not very pythonic:
cols_list = sorted(df.columns.tolist())
cols_list = list(filter(lambda x: 'des_' in x, cols_list))
new_df = df[cols_list]
lists = new_df.values.tolist()
occ_list = []
for lst in lists:
first_occurrence = lst[0]
counter = 0
for occurrence in lst:
if occurrence == first_occurrence:
counter += 1
else:
break
occ_list.append(counter)
counter = 0
df['count'] = occ_list
Any idea how to reduce it?
Thanks!
Upvotes: 2
Views: 161
Reputation: 153460
Here is a way to do it:
df = pd.DataFrame({'a1': [1,2,3,4],
'des_2': ['a','b','a', 'd'],
'des_4': ['a','c','c', 'd'],
'des_1': ['a','b','c', 'd'],
'des_3': ['a','b','c', 'a'],
'a2': [1,2,3,4],
'a3': [1,2,3,4]
})
df_des = df.filter(like='des_').sort_index(axis=1)
df['count'] = (
(df_des == df_des.shift(1, axis=1).bfill(axis=1))
.cumprod(axis=1)
.sum(axis=1)
)
df
Output:
a1 des_2 des_4 des_1 des_3 a2 a3 count
0 1 a a a a 1 1 4
1 2 b c b b 2 2 3
2 3 a c c c 3 3 1
3 4 d d d a 4 4 2
Explanation:
filter
with like
parameter to
get only columns with 'des_' and sort alphabetically with
sort_index
.shift
with axis=1
use bfill
to backwards fill
NaN.cumprod
to only count the first consecutive repeats, once a
different value is found cumprod will keep zeroes.sum
with axis=1
Upvotes: 2
Reputation: 23146
What is "pythonic" is very subjective but here is one way you can reduce your code:
#function to return the longest streak of the first element
def streak(srs):
groups = (srs!=srs.shift()).cumsum()
return srs.groupby(groups).size().iat[0]
#keep only the columns you need in the correct order
sample = df[[f"des_{i+1}" for i in range(4)]]
df["count"]= sample.apply(streak, axis=1)
>>> df
a1 des_2 des_4 des_1 des_3 a2 a3 count
0 1 a a a a 1 1 4
1 2 b c b b 2 2 3
2 3 a c c c 3 3 1
3 4 d d d a 4 4 2
Upvotes: 2