Reputation: 61
I am having issue deleting nulls. My input dataframe
name no city tr1_0 tr2_0 tr3_0 tr1_1 tr2_1 tr3_1 tr1_2 tr2_2 tr3_2
John 11 edi boa 51 110 cof 52 220
Rick 12 new cof 61 100 dcu 61 750
Mat t1 nyc
my desired output
name no city tr1 tr3 tr2
0 John 11 edi boa 110 51
1 John 11 edi cof 220 52
2 Rick 12 new cof 100 61
3 Rick 12 new dcu 750 61
4 Matt 13 wil nan nan nan
i used below code
df1 = pd.read_fwf(inputFileName, widths=widths, names=names, dtype=str, index_col=False )
feature_models = [col for col in df1.columns if re.match("tr[0-9]_[0-9]",col) is not None]
features = list(set([ re.sub("_[0-9]","",feature_model) for feature_model in feature_models]))
ub("_[0-9]","",feature_model) for feature_model in feature_models]))
df1 = pd.wide_to_long(df1,i=['name', 'no',
df1 = pd.wide_to_long(df1,i=['name', 'no', 'city',],j='ModelID',stubnames=features,sep="_")
my current output as below. row 2 doesn't make any sense in my use case so i don't want to generate that row at all. if there is no trailer i only want 1 row which is good (row 6). if there are 2 trailers,i only want 2 rows but its giving me 3 rows. (row 2 and row 5 are extra). i tried using dropna but its not working. Also in my case its printing as nan not NaN.
name no city tr1 tr3 tr2
0 John 11 edi boa 110 51 .
1 John 11 edi cof 220 52 .
2 John 11 edi nan nan nan .
3 Rick 12 new cof 100 61 .
4 Rick 12 new dcu 750 61 .
5 Rick 12 new nan nan nan .
6 Matt 13 wil nan nan nan .
Upvotes: 3
Views: 112
Reputation: 863166
You can use this alternative solution with split
and stack
:
df1 = df1.set_index(['name', 'no', 'city'])
df1.columns = df1.columns.str.split('_', expand=True)
df1 = df1.stack(1, dropna=False).reset_index(level=3, drop=True)
mask = df1.index.duplicated() & df1.isnull().all(axis=1)
df1 = df1[~mask].reset_index()
print (df1)
name no city tr1 tr2 tr3
0 John 11 edi boa 51.0 110.0
1 John 11 edi cof 52.0 220.0
2 Rick 12 new cof 61.0 100.0
3 Rick 12 new dcu 61.0 750.0
4 Mat t1 nyc NaN NaN NaN
With your solution:
df1 = pd.wide_to_long(df1,i=['name', 'no', 'city'],j='ModelID',stubnames=features,sep="_")
For remove NaN
s with duplicated MultiIndex
values is possible use filtering by boolean indexing
:
#remove counting level
df1 = df1.reset_index(level=3, drop=True)
mask = df1.index.duplicated() & df1.isnull().all(axis=1)
df1 = df1[~mask].reset_index()
Details:
Check dupes by Index.duplicated
:
print (df1.index.duplicated())
[False True False True False True]
Then check missing values by DataFrame.all
for check all True
s per rows:
print (df1.isnull().all(axis=1))
name no city
John 11 edi False
edi False
Rick 12 new False
new False
Mat t1 nyc True
nyc True
dtype: bool
Chain by &
for bitwise AND
:
mask = df1.index.duplicated() & df1.isnull().all(axis=1)
print (mask)
name no city
John 11 edi False
edi False
Rick 12 new False
new False
Mat t1 nyc False
nyc True
dtype: bool
Invert boolean mask by ~
:
print (~mask)
name no city
John 11 edi True
edi True
Rick 12 new True
new True
Mat t1 nyc True
nyc False
dtype: bool
Upvotes: 2