martifapa
martifapa

Reputation: 117

Create new column in DataFrame based on comparison against row values

I have a dataframe that looks something like this:

df = pd.DataFrame({'a1': [1,2,3,4],
               'des_2': ['a','b','a', 'd'],
               'des_4': ['a','c','c', 'd'],
               'des_1': ['a','b','c', 'd'],
               'des_3': ['a','b','c', 'a'],
               'a2': [1,2,3,4],
               'a3': [1,2,3,4]
               })

And I want to create a new column that shows the number of adjacent repeats of the first value, taking into account only the columns with 'des_', ordered alphabetically.

In the first row, all occurences are equal to a, so the count adds up to 4. In the second row, des_1, des_2, and des_3 are equal to b, so the cound adds up to 3. And so on.

   a1  des_2  des_4  des_1  des_3  a2  a3  count
0  1   a      a      a      a      1   1   4
1  2   b      c      b      b      2   2   3
2  3   a      c      c      c      3   3   1
3  4   d      d      d      a      4   4   2

I have a working code, but I feel it's not very pythonic:

cols_list = sorted(df.columns.tolist())
cols_list = list(filter(lambda x: 'des_' in x, cols_list))

new_df = df[cols_list]
lists = new_df.values.tolist()
occ_list = []
for lst in lists:
    first_occurrence = lst[0]
    counter = 0
    for occurrence in lst:
        if occurrence == first_occurrence:
            counter += 1
        else:
            break
    occ_list.append(counter)
    counter = 0

df['count'] = occ_list

Any idea how to reduce it?

Thanks!

Upvotes: 2

Views: 161

Answers (2)

Scott Boston
Scott Boston

Reputation: 153460

Here is a way to do it:

df = pd.DataFrame({'a1': [1,2,3,4],
               'des_2': ['a','b','a', 'd'],
               'des_4': ['a','c','c', 'd'],
               'des_1': ['a','b','c', 'd'],
               'des_3': ['a','b','c', 'a'],
               'a2': [1,2,3,4],
               'a3': [1,2,3,4]
               })

df_des = df.filter(like='des_').sort_index(axis=1)
df['count'] = (
               (df_des == df_des.shift(1, axis=1).bfill(axis=1))
                   .cumprod(axis=1)
                   .sum(axis=1)
              )
df

Output:

   a1 des_2 des_4 des_1 des_3  a2  a3  count
0   1     a     a     a     a   1   1      4
1   2     b     c     b     b   2   2      3
2   3     a     c     c     c   3   3      1
3   4     d     d     d     a   4   4      2

Explanation:

  • First, subset the dataframe using filter with like parameter to get only columns with 'des_' and sort alphabetically with sort_index.
  • Next, compare with shifted dataframe using shift with axis=1 use bfill to backwards fill NaN.
  • Use cumprod to only count the first consecutive repeats, once a different value is found cumprod will keep zeroes.
  • Lastly, sum with axis=1

Upvotes: 2

not_speshal
not_speshal

Reputation: 23146

What is "pythonic" is very subjective but here is one way you can reduce your code:

#function to return the longest streak of the first element
def streak(srs):
    groups = (srs!=srs.shift()).cumsum()
    return srs.groupby(groups).size().iat[0]

#keep only the columns you need in the correct order
sample = df[[f"des_{i+1}" for i in range(4)]]
df["count"]= sample.apply(streak, axis=1)

>>> df
   a1 des_2 des_4 des_1 des_3  a2  a3  count
0   1     a     a     a     a   1   1      4
1   2     b     c     b     b   2   2      3
2   3     a     c     c     c   3   3      1
3   4     d     d     d     a   4   4      2

Upvotes: 2

Related Questions