tlk27
tlk27

Reputation: 309

pd.Series string replacement modifying values unexpectedly

I assume there is something I am missing with my understanding of string replacement. Your help is appreciated in advance.

I have a data frame similar to below:

combine_cols = pd.DataFrame({'totalannualmembers' : [85494.0, np.NaN, np.NaN], 
              'totalannualmembershipssold' : [np.NaN, 90000.0, np.NaN]})

I want to merge or combine the 2 columns into a single columns. To do so, I replace NaN with empty strings and convert the columns to str type. The code below combines the 2 columns.

join_cols = ['totalannualmembers', 'totalannualmembershipssold']

#fill NaN with empty string
combine_cols.fillna('', axis = 1, inplace = True)

#convert columns to str type
combine_cols = combine_cols[join_cols].astype(str)

#combine into a single column
combine_cols['combine_test'] = combine_cols['totalannualmembers'].str.cat(combine_cols['totalannualmembershipssold'],sep="")

My problem arises when I try to convert the combine_cols.combine_test column back to an integer / float type due to empty strings. I have tried replacing empty strings with the following:

combine_cols.combine_test = combine_cols.combine_test.str.replace('', '9999')

While this changes the empty strings, it also seems to be changing the values I want to keep. For example, combine_test index 550 also changes when the desired output is to remain as 85494.0.

combine_cols.iloc[550]
---------------------------------------------------------------------------
totalannualmembers                                            85494.0
totalannualmembershipssold                                           
combine_test                  99998999959999499999999949999.999909999
Name: 550, dtype: object

Is there a better way to merge these 2 columns into one or to convert the empty strings so I can convert the column back to a float type for analysis?

The desired output for combine_cols.combine_test would be:

0       85494.0
1       90000.0
2       NaN

Upvotes: 0

Views: 72

Answers (5)

Quang Hoang
Quang Hoang

Reputation: 150765

You can do with fillna:

combine_cols['combine_test'] = (combine_cols['totalannualmembers']
                                   .fillna(combine_cols.totalannualmembershipssold)
                               )

Output:

   totalannualmembers  totalannualmembershipssold  combine_test
0             85494.0                         NaN       85494.0
1                 NaN                     90000.0       90000.0
2                 NaN                         NaN           NaN

Upvotes: 2

Rodrigo P. Dias
Rodrigo P. Dias

Reputation: 61

Try to use pandas.concat() and see if it works for you. I also removed all missing values using pandas.DataFrame.dropna() and reseted index using pandas.DataFrame.reset_index.

combine_cols = pd.concat([combine_cols['totalannualmembers'],combine_cols['totalannualmembershipssold']], ignore_index=True)
combine_cols.dropna(inplace=True)
combine_cols.reset_index(drop=True, inplace=True)
print(combine_cols)

The output for this should be:

0    85494.0
1    90000.0
dtype: float64

Check this websites for more information about those :

[1] https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

[2] https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

[3] https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html

Upvotes: 0

David Erickson
David Erickson

Reputation: 16683

A more "fun" way to do it :) if you are looking for a one-liner after creating the dataframe.

combine_cols.melt(value_name='combine_test').drop('variable', axis=1).drop_duplicates().reset_index(drop=True)

1) Melt the columns into one combined column with the column headers being moved to rows and name the column 'combine_test'

2) Drop the unnecessary 'variable' column that was created where the column headers became rows from melting.

3) Drop the duplicate rows and reset the index to 0, 1, 2 and pass drop=True to avoid adding new column after reset.

Upvotes: 1

IMCoins
IMCoins

Reputation: 3306

Your solution seems too complicated for what you wish to do to me. You may simply want to do...

Plus, as you don't deal with strings... it must be faster.

combine_cols = pd.DataFrame({
    'totalannualmembers' : [85494.0, np.NaN, np.NaN], 
    'totalannualmembershipssold' : [np.NaN, 90000.0, np.NaN]
})

combine_cols['test'] = combine_cols.loc[:, 'totalannualmembers']

mask = combine_cols['totalannualmembers'].isna()
combine_cols.loc[mask, 'test'] = combine_cols.loc[mask, 'totalannualmembershipssold']
print(combine_cols)
# 0             85494.0                         NaN  85494.0
# 1                 NaN                     90000.0  90000.0
# 2                 NaN                         NaN      NaN

Upvotes: 1

EnriqueBet
EnriqueBet

Reputation: 1473

I believe the approach that you must follow is:

combine_cols.combine_test[combine_cols.combine_test == ""] = "9999"

with combine_cols.combine_test[combine_cols.combine_test == ""] we are asking to our data frame to search only the rows on the combine_test column that matches an empty string and we are replacing it with the string "9999".

Please let me know if this helps :D!

Upvotes: 1

Related Questions