pi_pascal
pi_pascal

Reputation: 202

Transpose Pandas dataframe but keep only nonzero values

DataFrame:

Let me clarify my question. My pandas.DataFrame looks like this

data = [
    ['word11', 'word12', 'word13', 'word14', 0, 0, 0, 0, 0],
    ['word21', 'word22', 'word23', 'word24', 0, -3, 34, 0, 0],
    ['word31', 'word32', 'word33', 'word34', 0, 1.6, 0, 0, 0],
    ['word41', 'word42', 'word43', 'word44', 0, 0, 0, 0, 0]
]

df = pd.DataFrame(data, columns=['word1', 'word2', 'word3', 'word4', 'C1', 'C2', 'C3', 'C4', 'C5'])

Output to be generated:

From this, I want to get a data frame which looks like

    word1   word2   word3   word4  C1   C2  C3  C4  C5
0  word11  word12  word13  word14   0  0.0   0   0   0
1  word21  word22  word23  word24   0 -3.0  34   0   0
2  word31  word32  word33  word34   0  1.6   0   0   0
3  word41  word42  word43  word44   0  0.0   0   0   0

My Program:

Here is what I did to get the above data frame

primary_columns = ['word1', 'word2', 'word3', 'word4']
transposing_columns = ['C1', 'C2', 'C3', 'C4', 'C5']
transposed_df = df.melt(id_vars=primary_columns, value_vars=transposing_columns)
compare_columns = primary_columns + ['value']

Then I divided the data frame into two based on the value of 'value' column and dropped duplicates.

nonzero_df = transposed_df[transposed_df['value'] != 0]
zero_df = transposed_df[transposed_df['value'] == 0]
zero_df = zero_df.drop_duplicates(subset=compare_columns, keep='first')
df = nonzero_df.append(zero_df)

Which gives me the following output

df = df.reset_index(drop=True)
df

    word1   word2   word3   word4 variable  value
0  word21  word22  word23  word24       C2   -3.0
1  word31  word32  word33  word34       C2    1.6
2  word21  word22  word23  word24       C3   34.0
3  word11  word12  word13  word14       C1    0.0
4  word21  word22  word23  word24       C1    0.0
5  word31  word32  word33  word34       C1    0.0
6  word41  word42  word43  word44       C1    0.0

Issue:

I don't want to see df.iloc[4] and df.iloc[5].

If the values of word1, word2, word3 and word4 are the same but the difference is only in the value column, keep the row with non-zero value and drop the row with 0 value. I don't care about the value of the column variable.

How can I achieve this?

NOTE:

  1. My data frame is huge. It contains close to million rows more than 15 Word* type columns and more than 115 C* type columns(word* and C* are the column names I used for my example).
  2. I use Python 2.7 with Pandas 0.17.

Upvotes: 1

Views: 327

Answers (2)

Quang Hoang
Quang Hoang

Reputation: 150785

IIUC, you want to keep all the 1's in each row. And in case there are all 0 on the row, keep any value:

d = (df.melt(['word1','word2','word3'])
   .sort_values('value', ascending=False)
)

d[~d.duplicated(['word1','word2','word3']) | d['value']]

Output:

     word1   word2   word3 variable  value
2   word31  word32  word33       C1      1
4   word21  word22  word23       C2      1
7   word21  word22  word23       C3      1
11  word31  word32  word33       C4      1
0   word11  word12  word13       C1      0

Upvotes: 1

Valdi_Bo
Valdi_Bo

Reputation: 31011

Actually, your task is not a transposition, but something like stack, limited to non-zero values, with some additions namely, for rows containing all zeroes (in C1 thru C4) you want the output to contain values of word1 thru word3 and:

  • variable == 'C1',
  • value == 0.

To do it, compute 2 intermediate variables:

  1. A Series containing the stack of C1 thru C4 columns, with word1 thru word3 moved to the index and the last level of the index renamed to variable:

    s = df.set_index(['word1', 'word2', 'word3']).stack().rename('value')
    s.index.rename('variable', level=3, inplace=True)
    

    For your input data, the result is:

    word1   word2   word3   variable
    word11  word12  word13  C1          0
                            C2          0
                            C3          0
                            C4          0
    word21  word22  word23  C1          0
                            C2          1
                            C3          1
                            C4          0
    word31  word32  word33  C1          1
                            C2          0
                            C3          0
                            C4          1
    Name: value, dtype: int64
    
  2. Output result for rows containing all zeroes (in C1 thru C4):

    dfZer = df[df.loc[:, 'C1':'C4'].sum(axis=1) == 0].loc[:, 'word1':'word3']\
       .assign(variable='C1', value=0)
    

    For your data the result is:

        word1   word2   word3 variable  value
    0  word11  word12  word13       C1      0
    

Then generate the final result as:

pd.concat([s[s > 0].reset_index(), dfZer], sort=False, ignore_index=True)

Note that:

  • s[s > 0] drops elements with value of 0,
  • reset_index() coverts it back to a DataFrame,
  • dfZer provides the result for "all zeroes" input rows,
  • ignore_index=True "regenerates" the index in the result.

To trace how this solution works, print also s[s > 0].reset_index() and everything should be clear.

Upvotes: 1

Related Questions