Transpose Pandas dataframe but keep only nonzero values

Question

DataFrame:

Let me clarify my question. My pandas.DataFrame looks like this

data = [
    ['word11', 'word12', 'word13', 'word14', 0, 0, 0, 0, 0],
    ['word21', 'word22', 'word23', 'word24', 0, -3, 34, 0, 0],
    ['word31', 'word32', 'word33', 'word34', 0, 1.6, 0, 0, 0],
    ['word41', 'word42', 'word43', 'word44', 0, 0, 0, 0, 0]
]

df = pd.DataFrame(data, columns=['word1', 'word2', 'word3', 'word4', 'C1', 'C2', 'C3', 'C4', 'C5'])

Output to be generated:

From this, I want to get a data frame which looks like

    word1   word2   word3   word4  C1   C2  C3  C4  C5
0  word11  word12  word13  word14   0  0.0   0   0   0
1  word21  word22  word23  word24   0 -3.0  34   0   0
2  word31  word32  word33  word34   0  1.6   0   0   0
3  word41  word42  word43  word44   0  0.0   0   0   0

My Program:

Here is what I did to get the above data frame

primary_columns = ['word1', 'word2', 'word3', 'word4']
transposing_columns = ['C1', 'C2', 'C3', 'C4', 'C5']
transposed_df = df.melt(id_vars=primary_columns, value_vars=transposing_columns)
compare_columns = primary_columns + ['value']

Then I divided the data frame into two based on the value of 'value' column and dropped duplicates.

nonzero_df = transposed_df[transposed_df['value'] != 0]
zero_df = transposed_df[transposed_df['value'] == 0]
zero_df = zero_df.drop_duplicates(subset=compare_columns, keep='first')
df = nonzero_df.append(zero_df)

Which gives me the following output

df = df.reset_index(drop=True)
df

    word1   word2   word3   word4 variable  value
0  word21  word22  word23  word24       C2   -3.0
1  word31  word32  word33  word34       C2    1.6
2  word21  word22  word23  word24       C3   34.0
3  word11  word12  word13  word14       C1    0.0
4  word21  word22  word23  word24       C1    0.0
5  word31  word32  word33  word34       C1    0.0
6  word41  word42  word43  word44       C1    0.0

Issue:

I don't want to see df.iloc[4] and df.iloc[5].

If the values of word1, word2, word3 and word4 are the same but the difference is only in the value column, keep the row with non-zero value and drop the row with 0 value. I don't care about the value of the column variable.

How can I achieve this?

NOTE:

My data frame is huge. It contains close to million rows more than 15 Word* type columns and more than 115 C* type columns(word* and C* are the column names I used for my example).
I use Python 2.7 with Pandas 0.17.

Valdi_Bo · Accepted Answer

Actually, your task is not a transposition, but something like stack, limited to non-zero values, with some additions namely, for rows containing all zeroes (in C1 thru C4) you want the output to contain values of word1 thru word3 and:

variable == 'C1',
value == 0.

To do it, compute 2 intermediate variables:

A Series containing the stack of C1 thru C4 columns, with word1 thru word3 moved to the index and the last level of the index renamed to variable:

s = df.set_index(['word1', 'word2', 'word3']).stack().rename('value')
s.index.rename('variable', level=3, inplace=True)

For your input data, the result is:

word1   word2   word3   variable
word11  word12  word13  C1          0
                        C2          0
                        C3          0
                        C4          0
word21  word22  word23  C1          0
                        C2          1
                        C3          1
                        C4          0
word31  word32  word33  C1          1
                        C2          0
                        C3          0
                        C4          1
Name: value, dtype: int64

Output result for rows containing all zeroes (in C1 thru C4):

dfZer = df[df.loc[:, 'C1':'C4'].sum(axis=1) == 0].loc[:, 'word1':'word3']\
   .assign(variable='C1', value=0)

For your data the result is:

    word1   word2   word3 variable  value
0  word11  word12  word13       C1      0

Then generate the final result as:

pd.concat([s[s > 0].reset_index(), dfZer], sort=False, ignore_index=True)

Note that:

s[s > 0] drops elements with value of 0,
reset_index() coverts it back to a DataFrame,
dfZer provides the result for "all zeroes" input rows,
ignore_index=True "regenerates" the index in the result.

To trace how this solution works, print also s[s > 0].reset_index() and everything should be clear.

Transpose Pandas dataframe but keep only nonzero values

Answers (2)

Related Questions