Reputation: 1780
I have a dataframe in pandas with mixed int and str data columns. I want to concatenate first the columns within the dataframe. To do that I have to convert an int
column to str
.
I've tried to do as follows:
mtrx['X.3'] = mtrx.to_string(columns = ['X.3'])
or
mtrx['X.3'] = mtrx['X.3'].astype(str)
but in both cases it's not working and I'm getting an error saying "cannot concatenate 'str' and 'int' objects". Concatenating two str
columns is working perfectly fine.
Upvotes: 168
Views: 552052
Reputation: 91
I realise this is an old question, but since that's the first things that comes up for df string conversion so IMHO it shall be up to date.
If you want the actual dtype to be string (rather than object) and/or if you need to handle datetime conversion in your df and/or you have NaN/None in you df. None of the above will work.
you should use:
df.astype('string')
You can compare results on this df:
import pandas as pd
import numpy as np
from datetime import datetime
# Example dataframe
min_index = datetime(2050, 5, 2, 0, 0, 0)
max_index = datetime(2050, 5, 3, 23, 59, 0)
df = pd.DataFrame(data=pd.date_range(start=min_index, end=max_index, freq = "H"), columns=["datetime"])
df["hours"] = df["datetime"].dt.hour
df["day_name"] = df["datetime"].dt.strftime("%A")
df["numeric_cat"] = [np.random.choice([0,1,2]) for a in range(df.shape[0])]
# Add missing values:
df = df.mask(np.random.random(df.shape) < 0.1)
# str
df1 = df.astype(str) #same pb with apply(str)
df1.isnull().sum().sum() # return 0 which is wrong
df1.info() #gives you a dtype object
# string
df2 = df.astype('string')
df2.isnull().sum().sum() # return the correct nb of missing value
df2.info() #gives you a dtype string
Upvotes: 8
Reputation: 979
There are four ways to convert columns to string
1. astype(str)
df['column_name'] = df['column_name'].astype(str)
2. values.astype(str)
df['column_name'] = df['column_name'].values.astype(str)
3. map(str)
df['column_name'] = df['column_name'].map(str)
4. apply(str)
df['column_name'] = df['column_name'].apply(str)
Lets see the performance of each type
#importing libraries
import numpy as np
import pandas as pd
import time
#creating four sample dataframes using dummy data
df1 = pd.DataFrame(np.random.randint(1, 1000, size =(10000000, 1)), columns =['A'])
df2 = pd.DataFrame(np.random.randint(1, 1000, size =(10000000, 1)), columns =['A'])
df3 = pd.DataFrame(np.random.randint(1, 1000, size =(10000000, 1)), columns =['A'])
df4 = pd.DataFrame(np.random.randint(1, 1000, size =(10000000, 1)), columns =['A'])
#applying astype(str)
time1 = time.time()
df1['A'] = df1['A'].astype(str)
print('time taken for astype(str) : ' + str(time.time()-time1) + ' seconds')
#applying values.astype(str)
time2 = time.time()
df2['A'] = df2['A'].values.astype(str)
print('time taken for values.astype(str) : ' + str(time.time()-time2) + ' seconds')
#applying map(str)
time3 = time.time()
df3['A'] = df3['A'].map(str)
print('time taken for map(str) : ' + str(time.time()-time3) + ' seconds')
#applying apply(str)
time4 = time.time()
df4['A'] = df4['A'].apply(str)
print('time taken for apply(str) : ' + str(time.time()-time4) + ' seconds')
Output
time taken for astype(str): 5.472359895706177 seconds
time taken for values.astype(str): 6.5844292640686035 seconds
time taken for map(str): 2.3686647415161133 seconds
time taken for apply(str): 2.39758563041687 seconds
If you run multiple times, time for each technique might vary.
On average map(str)
and apply(str)
are takes less time compare with remaining two techniques
Upvotes: 21
Reputation: 1239
Just for an additional reference.
All of the above answers will work in case of a data frame. But if you are using lambda while creating / modify a column the above answer by others won't work, Because there it is considered as a int attribute instead of pandas series. You have to use str( target_attribute ) to make it as a string. Please refer the below example.
def add_zero_in_prefix(df):
if(df['Hour']<10):
return '0' + str(df['Hour'])
data['str_hr'] = data.apply(add_zero_in_prefix, axis=1)
Upvotes: 0
Reputation: 345
Use the following code:
df.column_name = df.column_name.astype('str')
Upvotes: 16
Reputation: 129048
In [16]: df = DataFrame(np.arange(10).reshape(5,2),columns=list('AB'))
In [17]: df
Out[17]:
A B
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
In [18]: df.dtypes
Out[18]:
A int64
B int64
dtype: object
Convert a series
In [19]: df['A'].apply(str)
Out[19]:
0 0
1 2
2 4
3 6
4 8
Name: A, dtype: object
In [20]: df['A'].apply(str)[0]
Out[20]: '0'
Don't forget to assign the result back:
df['A'] = df['A'].apply(str)
Convert the whole frame
In [21]: df.applymap(str)
Out[21]:
A B
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
In [22]: df.applymap(str).iloc[0,0]
Out[22]: '0'
df = df.applymap(str)
Upvotes: 191
Reputation: 4924
Warning: Both solutions given ( astype() and apply() ) do not preserve NULL values in either the nan or the None form.
import pandas as pd
import numpy as np
df = pd.DataFrame([None,'string',np.nan,42], index=[0,1,2,3], columns=['A'])
df1 = df['A'].astype(str)
df2 = df['A'].apply(str)
print df.isnull()
print df1.isnull()
print df2.isnull()
I believe this is fixed by the implementation of to_string()
Upvotes: 23
Reputation: 1321
Change data type of DataFrame column:
To int:
df.column_name = df.column_name.astype(np.int64)
To str:
df.column_name = df.column_name.astype(str)
Upvotes: 132