Reputation: 309
I assume there is something I am missing with my understanding of string replacement. Your help is appreciated in advance.
I have a data frame similar to below:
combine_cols = pd.DataFrame({'totalannualmembers' : [85494.0, np.NaN, np.NaN],
'totalannualmembershipssold' : [np.NaN, 90000.0, np.NaN]})
I want to merge or combine the 2 columns into a single columns. To do so, I replace NaN with empty strings and convert the columns to str type. The code below combines the 2 columns.
join_cols = ['totalannualmembers', 'totalannualmembershipssold']
#fill NaN with empty string
combine_cols.fillna('', axis = 1, inplace = True)
#convert columns to str type
combine_cols = combine_cols[join_cols].astype(str)
#combine into a single column
combine_cols['combine_test'] = combine_cols['totalannualmembers'].str.cat(combine_cols['totalannualmembershipssold'],sep="")
My problem arises when I try to convert the combine_cols.combine_test
column back to an integer / float type due to empty strings. I have tried replacing empty strings with the following:
combine_cols.combine_test = combine_cols.combine_test.str.replace('', '9999')
While this changes the empty strings, it also seems to be changing the values I want to keep. For example, combine_test index 550 also changes when the desired output is to remain as 85494.0.
combine_cols.iloc[550]
---------------------------------------------------------------------------
totalannualmembers 85494.0
totalannualmembershipssold
combine_test 99998999959999499999999949999.999909999
Name: 550, dtype: object
Is there a better way to merge these 2 columns into one or to convert the empty strings so I can convert the column back to a float type for analysis?
The desired output for combine_cols.combine_test would be:
0 85494.0
1 90000.0
2 NaN
Upvotes: 0
Views: 72
Reputation: 150765
You can do with fillna
:
combine_cols['combine_test'] = (combine_cols['totalannualmembers']
.fillna(combine_cols.totalannualmembershipssold)
)
Output:
totalannualmembers totalannualmembershipssold combine_test
0 85494.0 NaN 85494.0
1 NaN 90000.0 90000.0
2 NaN NaN NaN
Upvotes: 2
Reputation: 61
Try to use pandas.concat() and see if it works for you. I also removed all missing values using pandas.DataFrame.dropna() and reseted index using pandas.DataFrame.reset_index.
combine_cols = pd.concat([combine_cols['totalannualmembers'],combine_cols['totalannualmembershipssold']], ignore_index=True)
combine_cols.dropna(inplace=True)
combine_cols.reset_index(drop=True, inplace=True)
print(combine_cols)
The output for this should be:
0 85494.0
1 90000.0
dtype: float64
Check this websites for more information about those :
[1] https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
[2] https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
[3] https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html
Upvotes: 0
Reputation: 16683
A more "fun" way to do it :) if you are looking for a one-liner after creating the dataframe.
combine_cols.melt(value_name='combine_test').drop('variable', axis=1).drop_duplicates().reset_index(drop=True)
1) Melt the columns into one combined column with the column headers being moved to rows and name the column 'combine_test'
2) Drop the unnecessary 'variable' column that was created where the column headers became rows from melting.
3) Drop the duplicate rows and reset the index to 0, 1, 2 and pass drop=True to avoid adding new column after reset.
Upvotes: 1
Reputation: 3306
Your solution seems too complicated for what you wish to do to me. You may simply want to do...
Plus, as you don't deal with strings... it must be faster.
combine_cols = pd.DataFrame({
'totalannualmembers' : [85494.0, np.NaN, np.NaN],
'totalannualmembershipssold' : [np.NaN, 90000.0, np.NaN]
})
combine_cols['test'] = combine_cols.loc[:, 'totalannualmembers']
mask = combine_cols['totalannualmembers'].isna()
combine_cols.loc[mask, 'test'] = combine_cols.loc[mask, 'totalannualmembershipssold']
print(combine_cols)
# 0 85494.0 NaN 85494.0
# 1 NaN 90000.0 90000.0
# 2 NaN NaN NaN
Upvotes: 1
Reputation: 1473
I believe the approach that you must follow is:
combine_cols.combine_test[combine_cols.combine_test == ""] = "9999"
with combine_cols.combine_test[combine_cols.combine_test == ""]
we are asking to our data frame to search only the rows on the combine_test column that matches an empty string and we are replacing it with the string "9999".
Please let me know if this helps :D!
Upvotes: 1