NDommeti
NDommeti

Reputation: 61

pandas long_to_wide method based on input rows

I am having issue deleting nulls. My input dataframe

name    no     city     tr1_0  tr2_0    tr3_0     tr1_1  tr2_1      tr3_1   tr1_2   tr2_2   tr3_2 
John    11     edi      boa    51        110      cof      52       220   
Rick    12     new      cof    61        100      dcu      61       750   
Mat     t1     nyc

my desired output

     name    no city  tr1  tr3  tr2   
0    John    11  edi  boa  110   51  
1    John    11  edi  cof  220   52    
2    Rick    12  new  cof  100   61   
3    Rick    12  new  dcu  750   61  
4    Matt    13  wil  nan  nan  nan

i used below code

df1 = pd.read_fwf(inputFileName, widths=widths, names=names, dtype=str, index_col=False )

feature_models = [col for col in df1.columns if re.match("tr[0-9]_[0-9]",col) is not None]

features = list(set([ re.sub("_[0-9]","",feature_model) for feature_model in feature_models]))
ub("_[0-9]","",feature_model) for feature_model in feature_models]))

df1 = pd.wide_to_long(df1,i=['name', 'no', 
df1 = pd.wide_to_long(df1,i=['name', 'no', 'city',],j='ModelID',stubnames=features,sep="_")

my current output as below. row 2 doesn't make any sense in my use case so i don't want to generate that row at all. if there is no trailer i only want 1 row which is good (row 6). if there are 2 trailers,i only want 2 rows but its giving me 3 rows. (row 2 and row 5 are extra). i tried using dropna but its not working. Also in my case its printing as nan not NaN.

     name    no city  tr1  tr3  tr2 
0    John    11  edi  boa  110   51 .  
1    John    11  edi  cof  220   52 .  
2    John    11  edi  nan  nan  nan .  
3    Rick    12  new  cof  100   61 .  
4    Rick    12  new  dcu  750   61 .  
5    Rick    12  new  nan  nan  nan .  
6    Matt    13  wil  nan  nan  nan .  

Upvotes: 3

Views: 112

Answers (1)

jezrael
jezrael

Reputation: 863166

You can use this alternative solution with split and stack:

df1 = df1.set_index(['name', 'no', 'city'])
df1.columns = df1.columns.str.split('_', expand=True)
df1 = df1.stack(1, dropna=False).reset_index(level=3, drop=True)

mask = df1.index.duplicated() & df1.isnull().all(axis=1)

df1 = df1[~mask].reset_index()
print (df1)
   name  no city  tr1   tr2    tr3
0  John  11  edi  boa  51.0  110.0
1  John  11  edi  cof  52.0  220.0
2  Rick  12  new  cof  61.0  100.0
3  Rick  12  new  dcu  61.0  750.0
4   Mat  t1  nyc  NaN   NaN    NaN

With your solution:

df1 = pd.wide_to_long(df1,i=['name', 'no', 'city'],j='ModelID',stubnames=features,sep="_")

For remove NaNs with duplicated MultiIndex values is possible use filtering by boolean indexing:

#remove counting level
df1 = df1.reset_index(level=3, drop=True)
mask = df1.index.duplicated() & df1.isnull().all(axis=1)
df1 = df1[~mask].reset_index()

Details:

Check dupes by Index.duplicated:

print (df1.index.duplicated())
[False  True False  True False  True]

Then check missing values by DataFrame.all for check all Trues per rows:

print (df1.isnull().all(axis=1))
name  no  city
John  11  edi     False
          edi     False
Rick  12  new     False
          new     False
Mat   t1  nyc      True
          nyc      True
dtype: bool

Chain by & for bitwise AND:

mask = df1.index.duplicated() & df1.isnull().all(axis=1)
print (mask)
name  no  city
John  11  edi     False
          edi     False
Rick  12  new     False
          new     False
Mat   t1  nyc     False
          nyc      True
dtype: bool

Invert boolean mask by ~:

print (~mask)
name  no  city
John  11  edi      True
          edi      True
Rick  12  new      True
          new      True
Mat   t1  nyc      True
          nyc     False
dtype: bool

Upvotes: 2

Related Questions