Reputation: 12515
My data:
df
Out[79]:
INC Theme Theme_Hat TRAIN_TEST
0 123 A NaN TRAIN
1 124 A NaN TRAIN
2 125 A NaN TRAIN
3 126 A NaN TRAIN
4 127 A NaN TRAIN
5 128 A NaN TRAIN
6 129 A NaN TRAIN
7 130 A NaN TRAIN
8 131 B NaN TRAIN
9 132 B B TEST
10 133 B A TEST
11 134 B A TEST
12 135 B A TEST
I am trying to collapse the Theme_Hat
column into the Theme
column while maintaining the TRAIN_TEST
indicator. I have used a for
loop below, but my gut tells me there must be some more pandas
-esque solution out there. The attempt below doesn't reach my desired output, as TEST
is continuously duplicated throughout df
instead of the TRAIN
info being preserved. Here's my desired output:
Out[81]:
INC Theme TRAIN_TEST
0 123 A TRAIN
1 124 A TRAIN
2 125 A TRAIN
3 126 A TRAIN
4 127 A TRAIN
5 128 A TRAIN
6 129 A TRAIN
7 130 A TRAIN
8 131 B TRAIN
9 132 B TRAIN
10 132 B TEST
11 133 B TRAIN
12 133 A TEST
13 134 B TRAIN
14 134 A TEST
15 135 B TRAIN
16 135 A TEST
Here's what I've done so far:
# copy so we can reference the original dataframe as rows are inserted into df
df2 = df.copy(deep = True)
no_nulls = df2[df2['Theme_Hat'].notnull()]
# get rid of the Theme_Hat column for final dataframe (since we're migrating that info into Theme)
df.drop('Theme_Hat', inplace = True, axis = 1)
# I'm sure there's some pandas built-in functionality that
# can handle this better than a for loop
for idx in no_nulls.index:
# reference the unchanged df2 for INC, Theme_Hat, and TRAIN_TEST info
new_row = pd.DataFrame({"INC": df2.loc[idx, 'INC'],
"Theme": df2.loc[idx, 'Theme_Hat'],
"TRAIN_TEST": df2.loc[idx, 'TRAIN_TEST']}, index = [idx+1])
print(new_row, '\n\n')
# insert the new row right after the row at the current index
df = pd.concat([df.ix[:idx], new_row, df.ix[idx+1:]]).reset_index(drop = True)
Upvotes: 2
Views: 35
Reputation: 29711
Use pd.lreshape
which automatically drops NaNs
by default. Then, you can combine the two columns under consideration to take their union of values in a single column. Finally, sort these according to the INC
column values.
pd.lreshape(df, {'Theme': ['Theme','Theme_Hat']}).sort_values('INC').reset_index(drop=True)
Upvotes: 2
Reputation: 862721
You can use set_index
with stack
:
print (df.set_index(['INC','TRAIN_TEST'])
.stack()
.reset_index(level=2, drop=True)
.reset_index(name='Theme'))
INC TRAIN_TEST Theme
0 123 TRAIN A
1 124 TRAIN A
2 125 TRAIN A
3 126 TRAIN A
4 127 TRAIN A
5 128 TRAIN A
6 129 TRAIN A
7 130 TRAIN A
8 131 TRAIN B
9 132 TEST B
10 132 TEST B
11 133 TEST B
12 133 TEST A
13 134 TEST B
14 134 TEST A
15 135 TEST B
16 135 TEST A
Solution with melt
, remove column by drop
and NaN
by dropna
:
print (pd.melt(df, id_vars=['INC','TRAIN_TEST'], value_name='Theme')
.drop('variable', axis=1)
.dropna(subset=['Theme']))
INC TRAIN_TEST Theme
0 123 TRAIN A
1 124 TRAIN A
2 125 TRAIN A
3 126 TRAIN A
4 127 TRAIN A
5 128 TRAIN A
6 129 TRAIN A
7 130 TRAIN A
8 131 TRAIN B
9 132 TEST B
10 133 TEST B
11 134 TEST B
12 135 TEST B
22 132 TEST B
23 133 TEST A
24 134 TEST A
25 135 TEST A
Upvotes: 1