Reputation: 1747
This question is same to this posted earlier. I want to concatenate three columns instead of concatenating two columns:
Here is the combining two columns:
df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})
df['combined']=df.apply(lambda x:'%s_%s' % (x['foo'],x['bar']),axis=1)
df
bar foo new combined
0 1 a apple a_1
1 2 b banana b_2
2 3 c pear c_3
I want to combine three columns with this command but it is not working, any idea?
df['combined']=df.apply(lambda x:'%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)
Upvotes: 134
Views: 412736
Reputation: 399
If you want to join many columns in a large Dataframe, the fastest option is to write out a tedious statement:
df['new_col'] = df['col1'] + df['col2'] + ... + df['coln']
Here is a function that writes the statement for you.
def create_eval_statement(df_variable_name, columns, separator="_"):
columns_strings = [f"{df_variable_name}['c']" for c in columns]
return f" + '{separator}' + ".join(columns_strings)
stmt = create_eval_statement("df", ["col1", "col2"])
df["new_col"] = eval(stmt)
Code runs in 4.4 seconds for a Dataframe with 3 million rows and 17 columns.
Fast alternative with little code (6.0 seconds):
df[new_col] = df[columns].add(sep).sum(axis=1).str[:-1]
I listed and timed several options with the script below:
import timeit
import pandas as pd
# Create dataframe with 17 columns and 3 million rows, all strings
df = pd.DataFrame({chr(i + 65): [chr(i + 97)] * 3_000_000 for i in range(17)})
columns = list(df.columns)
sep = "_"
new_col = "new"
def create_exec_statement(
df_variable_name="df",
columns_variable_name="columns",
new_column_name="new",
separator="_",
):
columns_strings = [
f"{df_variable_name}[{columns_variable_name}[{i}]]"
for i in range(len(eval(columns_variable_name)))
]
separator = f" + '{separator}' + "
statement = (
f'{df_variable_name}["{new_column_name}"] = {separator.join(columns_strings)}'
)
return statement
def f1():
exec(
create_exec_statement(
df_variable_name="df",
columns_variable_name="columns",
new_column_name=new_col,
separator=sep,
)
)
def f2():
df[new_col] = df[columns[0]].str.cat(df[columns[1:]], sep=sep)
def f3():
df[new_col] = df[columns].T.add(sep).sum().str[:-1]
def f4():
df[new_col] = df[columns].add(sep).sum(axis=1).str[:-1]
def f5():
df[new_col] = df[columns].apply(lambda x: sep.join(x), axis=1)
def f6():
df[new_col] = df[columns].agg(sep.join, axis=1)
def f7():
df[new_col] = df[columns].T.agg(sep.join)
if __name__ == "__main__":
for func in [f1, f2, f3, f4, f5, f6, f7]:
print(f"{func.__name__}: {timeit.repeat(func, number=1, repeat=3)}")
# Results
# f1: [4.366812400025083, 4.43233589999727, 4.370704000000842]
# f2: [5.970817499997793, 5.898356199992122, 5.80382699999609]
# f3: [5.981191200000467, 5.959296400018502, 5.963758500001859]
# f4: [5.967713599995477, 6.032882600004086, 6.010665400011931]
# f5: [11.023198500013677, 10.792945499997586, 10.91107919998467]
# f6: [10.698224400024628, 10.668694899999537, 10.707435600023018]
# f7: [31.499697799998103, 31.31905089999782, 31.4950811000017]
Upvotes: 1
Reputation: 13622
Considering that one is combining three columns, one would need three format specifiers, '%s_%s_%s'
, not just two '%s_%s'
. The following will do the work
df['combined'] = df.apply(lambda x: '%s_%s_%s' % (x['foo'], x['bar'], x['new']), axis=1)
[Out]:
foo bar new combined
0 a 1 apple a_1_apple
1 b 2 banana b_2_banana
2 c 3 pear c_3_pear
Alternatively, if one wants to create a separate list to store the columns that one wants to combine, the following will do the work.
columns = ['foo', 'bar', 'new']
df['combined'] = df.apply(lambda x: '_'.join([str(x[i]) for i in columns]), axis=1)
[Out]:
foo bar new combined
0 a 1 apple a_1_apple
1 b 2 banana b_2_banana
2 c 3 pear c_3_pear
This last one is more convenient, as one can simply change or add the column names in the list - it will require less changes.
Upvotes: 2
Reputation: 957
following to @Allen response
If you need to chain such operation with other dataframe transformation, use assign
:
df.assign(
combined = lambda x: x[cols].apply(
lambda row: "_".join(row.values.astype(str)), axis=1
)
)
Upvotes: 0
Reputation: 7848
You can use string concatenation to combine columns, with or without delimiters. You do have to convert the type on non-string columns.
In[17]: df['combined'] = df['bar'].astype(str) + '_' + df['foo'] + '_' + df['new']
In[17]:df
Out[18]:
bar foo new combined
0 1 a apple 1_a_apple
1 2 b banana 2_b_banana
2 3 c pear 3_c_pear
Upvotes: 116
Reputation: 945
First convert the columns to str. Then use the .T.agg('_'.join) function to concatenate them. More info can be gotten here
# Initialize columns
cols_concat = ['first_name', 'second_name']
# Convert them to type str
df[cols_concat] = df[cols_concat].astype('str')
# Then concatenate them as follows
df['new_col'] = df[cols_concat].T.agg('_'.join)
Upvotes: 7
Reputation: 21
You could create a function which would make the implementation neater (esp. if you're using this functionality multiple times throughout an implementation):
def concat_cols(df, cols_to_concat, new_col_name, separator):
df[new_col_name] = ''
for i, col in enumerate(cols_to_concat):
df[new_col_name] += ('' if i == 0 else separator) + df[col].astype(str)
return df
Sample usage:
test = pd.DataFrame(data=[[1,2,3], [4,5,6], [7,8,9]], columns=['a', 'b', 'c'])
test = concat_cols(test, ['a', 'b', 'c'], 'concat_col', '_')
Upvotes: 2
Reputation: 1363
@derchambers I found one more solution:
import pandas as pd
# make data
df = pd.DataFrame(index=range(1_000_000))
df['1'] = 'CO'
df['2'] = 'BOB'
df['3'] = '01'
df['4'] = 'BILL'
def eval_join(df, columns):
sum_elements = [f"df['{col}']" for col in columns]
to_eval = "+ '_' + ".join(sum_elements)
return eval(to_eval)
#profile
%timeit df3 = eval_join(df, list('1234')) # 504 ms
Upvotes: 2
Reputation: 117
If you have a list of columns you want to concatenate and maybe you'd like to use some separator, here's what you can do
def concat_columns(df, cols_to_concat, new_col_name, sep=" "):
df[new_col_name] = df[cols_to_concat[0]]
for col in cols_to_concat[1:]:
df[new_col_name] = df[new_col_name].astype(str) + sep + df[col].astype(str)
This should be faster than apply
and takes an arbitrary number of columns to concatenate.
Upvotes: 3
Reputation: 15449
Possibly the fastest solution is to operate in plain Python:
Series(
map(
'_'.join,
df.values.tolist()
# when non-string columns are present:
# df.values.astype(str).tolist()
),
index=df.index
)
Comparison against @MaxU answer (using the big
data frame which has both numeric and string columns):
%timeit big['bar'].astype(str) + '_' + big['foo'] + '_' + big['new']
# 29.4 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit Series(map('_'.join, big.values.astype(str).tolist()), index=big.index)
# 27.4 ms ± 2.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Comparison against @derchambers answer (using their df
data frame where all columns are strings):
from functools import reduce
def reduce_join(df, columns):
slist = [df[x] for x in columns]
return reduce(lambda x, y: x + '_' + y, slist[1:], slist[0])
def list_map(df, columns):
return Series(
map(
'_'.join,
df[columns].values.tolist()
),
index=df.index
)
%timeit df1 = reduce_join(df, list('1234'))
# 602 ms ± 39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df2 = list_map(df, list('1234'))
# 351 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Upvotes: 12
Reputation: 954
The answer given by @allen is reasonably generic but can lack in performance for larger dataframes:
Reduce does a lot better:
from functools import reduce
import pandas as pd
# make data
df = pd.DataFrame(index=range(1_000_000))
df['1'] = 'CO'
df['2'] = 'BOB'
df['3'] = '01'
df['4'] = 'BILL'
def reduce_join(df, columns):
assert len(columns) > 1
slist = [df[x].astype(str) for x in columns]
return reduce(lambda x, y: x + '_' + y, slist[1:], slist[0])
def apply_join(df, columns):
assert len(columns) > 1
return df[columns].apply(lambda row:'_'.join(row.values.astype(str)), axis=1)
# ensure outputs are equal
df1 = reduce_join(df, list('1234'))
df2 = apply_join(df, list('1234'))
assert df1.equals(df2)
# profile
%timeit df1 = reduce_join(df, list('1234')) # 733 ms
%timeit df2 = apply_join(df, list('1234')) # 8.84 s
Upvotes: 9
Reputation: 201
df['New_column_name'] = df['Column1'].map(str) + 'X' + df['Steps']
X= x is any delimiter (eg: space) by which you want to separate two merged column.
Upvotes: 2
Reputation: 2455
Another solution using DataFrame.apply()
, with slightly less typing and more scalable when you want to join more columns:
cols = ['foo', 'bar', 'new']
df['combined'] = df[cols].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)
Upvotes: 197
Reputation: 1751
If you have even more columns you want to combine, using the Series method str.cat
might be handy:
df["combined"] = df["foo"].str.cat(df[["bar", "new"]].astype(str), sep="_")
Basically, you select the first column (if it is not already of type str
, you need to append .astype(str)
), to which you append the other columns (separated by an optional separator character).
Upvotes: 29
Reputation: 1588
df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})
df['combined'] = df['foo'].astype(str)+'_'+df['bar'].astype(str)
If you concatenate with string('_') please you convert the column to string which you want and after you can concatenate the dataframe.
Upvotes: 2
Reputation: 210972
Just wanted to make a time comparison for both solutions (for 30K rows DF):
In [1]: df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})
In [2]: big = pd.concat([df] * 10**4, ignore_index=True)
In [3]: big.shape
Out[3]: (30000, 3)
In [4]: %timeit big.apply(lambda x:'%s_%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)
1 loop, best of 3: 881 ms per loop
In [5]: %timeit big['bar'].astype(str)+'_'+big['foo']+'_'+big['new']
10 loops, best of 3: 44.2 ms per loop
a few more options:
In [6]: %timeit big.ix[:, :-1].astype(str).add('_').sum(axis=1).str.cat(big.new)
10 loops, best of 3: 72.2 ms per loop
In [11]: %timeit big.astype(str).add('_').sum(axis=1).str[:-1]
10 loops, best of 3: 82.3 ms per loop
Upvotes: 18
Reputation: 3930
I think you are missing one %s
df['combined']=df.apply(lambda x:'%s_%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)
Upvotes: 8