Reputation: 202
DataFrame:
Let me clarify my question. My pandas.DataFrame
looks like this
data = [
['word11', 'word12', 'word13', 'word14', 0, 0, 0, 0, 0],
['word21', 'word22', 'word23', 'word24', 0, -3, 34, 0, 0],
['word31', 'word32', 'word33', 'word34', 0, 1.6, 0, 0, 0],
['word41', 'word42', 'word43', 'word44', 0, 0, 0, 0, 0]
]
df = pd.DataFrame(data, columns=['word1', 'word2', 'word3', 'word4', 'C1', 'C2', 'C3', 'C4', 'C5'])
Output to be generated:
From this, I want to get a data frame which looks like
word1 word2 word3 word4 C1 C2 C3 C4 C5
0 word11 word12 word13 word14 0 0.0 0 0 0
1 word21 word22 word23 word24 0 -3.0 34 0 0
2 word31 word32 word33 word34 0 1.6 0 0 0
3 word41 word42 word43 word44 0 0.0 0 0 0
My Program:
Here is what I did to get the above data frame
primary_columns = ['word1', 'word2', 'word3', 'word4']
transposing_columns = ['C1', 'C2', 'C3', 'C4', 'C5']
transposed_df = df.melt(id_vars=primary_columns, value_vars=transposing_columns)
compare_columns = primary_columns + ['value']
Then I divided the data frame into two based on the value of 'value' column and dropped duplicates.
nonzero_df = transposed_df[transposed_df['value'] != 0]
zero_df = transposed_df[transposed_df['value'] == 0]
zero_df = zero_df.drop_duplicates(subset=compare_columns, keep='first')
df = nonzero_df.append(zero_df)
Which gives me the following output
df = df.reset_index(drop=True)
df
word1 word2 word3 word4 variable value
0 word21 word22 word23 word24 C2 -3.0
1 word31 word32 word33 word34 C2 1.6
2 word21 word22 word23 word24 C3 34.0
3 word11 word12 word13 word14 C1 0.0
4 word21 word22 word23 word24 C1 0.0
5 word31 word32 word33 word34 C1 0.0
6 word41 word42 word43 word44 C1 0.0
Issue:
I don't want to see df.iloc[4]
and df.iloc[5]
.
If the values of word1
, word2
, word3
and word4
are the same but the difference is only in the value
column, keep the row with non-zero value and drop the row with 0 value. I don't care about the value of the column variable
.
How can I achieve this?
NOTE:
Word*
type columns and more than 115 C*
type columns(word*
and C*
are the column names I used for my example).Python 2.7
with Pandas 0.17
.Upvotes: 1
Views: 327
Reputation: 150785
IIUC, you want to keep all the 1
's in each row. And in case there are all 0
on the row, keep any value:
d = (df.melt(['word1','word2','word3'])
.sort_values('value', ascending=False)
)
d[~d.duplicated(['word1','word2','word3']) | d['value']]
Output:
word1 word2 word3 variable value
2 word31 word32 word33 C1 1
4 word21 word22 word23 C2 1
7 word21 word22 word23 C3 1
11 word31 word32 word33 C4 1
0 word11 word12 word13 C1 0
Upvotes: 1
Reputation: 31011
Actually, your task is not a transposition, but something like stack, limited to non-zero values, with some additions namely, for rows containing all zeroes (in C1 thru C4) you want the output to contain values of word1 thru word3 and:
To do it, compute 2 intermediate variables:
A Series containing the stack of C1 thru C4 columns, with word1 thru word3 moved to the index and the last level of the index renamed to variable:
s = df.set_index(['word1', 'word2', 'word3']).stack().rename('value')
s.index.rename('variable', level=3, inplace=True)
For your input data, the result is:
word1 word2 word3 variable
word11 word12 word13 C1 0
C2 0
C3 0
C4 0
word21 word22 word23 C1 0
C2 1
C3 1
C4 0
word31 word32 word33 C1 1
C2 0
C3 0
C4 1
Name: value, dtype: int64
Output result for rows containing all zeroes (in C1 thru C4):
dfZer = df[df.loc[:, 'C1':'C4'].sum(axis=1) == 0].loc[:, 'word1':'word3']\
.assign(variable='C1', value=0)
For your data the result is:
word1 word2 word3 variable value
0 word11 word12 word13 C1 0
Then generate the final result as:
pd.concat([s[s > 0].reset_index(), dfZer], sort=False, ignore_index=True)
Note that:
s[s > 0]
drops elements with value of 0,reset_index()
coverts it back to a DataFrame,dfZer
provides the result for "all zeroes" input rows,ignore_index=True
"regenerates" the index in the result.To trace how this solution works, print also s[s > 0].reset_index()
and everything should be clear.
Upvotes: 1