NamAshena
NamAshena

Reputation: 1747

How to concatenate multiple column values into a single column in Pandas dataframe

This question is same to this posted earlier. I want to concatenate three columns instead of concatenating two columns:

Here is the combining two columns:

df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})

df['combined']=df.apply(lambda x:'%s_%s' % (x['foo'],x['bar']),axis=1)

df
    bar foo new combined
0   1   a   apple   a_1
1   2   b   banana  b_2
2   3   c   pear    c_3

I want to combine three columns with this command but it is not working, any idea?

df['combined']=df.apply(lambda x:'%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)

Upvotes: 134

Views: 412736

Answers (16)

3UqU57GnaX
3UqU57GnaX

Reputation: 399

If you want to join many columns in a large Dataframe, the fastest option is to write out a tedious statement:

df['new_col'] = df['col1'] + df['col2'] + ... + df['coln']

Here is a function that writes the statement for you.

def create_eval_statement(df_variable_name, columns, separator="_"):
    columns_strings = [f"{df_variable_name}['c']" for c in columns]
    return f" + '{separator}' + ".join(columns_strings)


stmt = create_eval_statement("df", ["col1", "col2"])
df["new_col"] = eval(stmt)

Code runs in 4.4 seconds for a Dataframe with 3 million rows and 17 columns.

Fast alternative with little code (6.0 seconds):

  • df[new_col] = df[columns].add(sep).sum(axis=1).str[:-1]

I listed and timed several options with the script below:

import timeit

import pandas as pd

# Create dataframe with 17 columns and 3 million rows, all strings
df = pd.DataFrame({chr(i + 65): [chr(i + 97)] * 3_000_000 for i in range(17)})

columns = list(df.columns)
sep = "_"
new_col = "new"


def create_exec_statement(
    df_variable_name="df",
    columns_variable_name="columns",
    new_column_name="new",
    separator="_",
):
    columns_strings = [
        f"{df_variable_name}[{columns_variable_name}[{i}]]"
        for i in range(len(eval(columns_variable_name)))
    ]
    separator = f" + '{separator}' + "
    statement = (
        f'{df_variable_name}["{new_column_name}"] = {separator.join(columns_strings)}'
    )
    return statement


def f1():
    exec(
        create_exec_statement(
            df_variable_name="df",
            columns_variable_name="columns",
            new_column_name=new_col,
            separator=sep,
        )
    )


def f2():
    df[new_col] = df[columns[0]].str.cat(df[columns[1:]], sep=sep)


def f3():
    df[new_col] = df[columns].T.add(sep).sum().str[:-1]


def f4():
    df[new_col] = df[columns].add(sep).sum(axis=1).str[:-1]


def f5():
    df[new_col] = df[columns].apply(lambda x: sep.join(x), axis=1)


def f6():
    df[new_col] = df[columns].agg(sep.join, axis=1)


def f7():
    df[new_col] = df[columns].T.agg(sep.join)


if __name__ == "__main__":
    for func in [f1, f2, f3, f4, f5, f6, f7]:
        print(f"{func.__name__}: {timeit.repeat(func, number=1, repeat=3)}")

# Results
# f1: [4.366812400025083, 4.43233589999727, 4.370704000000842]
# f2: [5.970817499997793, 5.898356199992122, 5.80382699999609]
# f3: [5.981191200000467, 5.959296400018502, 5.963758500001859]
# f4: [5.967713599995477, 6.032882600004086, 6.010665400011931]
# f5: [11.023198500013677, 10.792945499997586, 10.91107919998467]
# f6: [10.698224400024628, 10.668694899999537, 10.707435600023018]
# f7: [31.499697799998103, 31.31905089999782, 31.4950811000017]

Upvotes: 1

Gonçalo Peres
Gonçalo Peres

Reputation: 13622

Considering that one is combining three columns, one would need three format specifiers, '%s_%s_%s', not just two '%s_%s'. The following will do the work

df['combined'] = df.apply(lambda x: '%s_%s_%s' % (x['foo'], x['bar'], x['new']), axis=1)

[Out]:
  foo  bar     new    combined
0   a    1   apple   a_1_apple
1   b    2  banana  b_2_banana
2   c    3    pear    c_3_pear

Alternatively, if one wants to create a separate list to store the columns that one wants to combine, the following will do the work.

columns = ['foo', 'bar', 'new']

df['combined'] = df.apply(lambda x: '_'.join([str(x[i]) for i in columns]), axis=1)

[Out]:
  foo  bar     new    combined
0   a    1   apple   a_1_apple
1   b    2  banana  b_2_banana
2   c    3    pear    c_3_pear

This last one is more convenient, as one can simply change or add the column names in the list - it will require less changes.

Upvotes: 2

Antiez
Antiez

Reputation: 957

following to @Allen response
If you need to chain such operation with other dataframe transformation, use assign:

df.assign(
    combined = lambda x: x[cols].apply(
        lambda row: "_".join(row.values.astype(str)), axis=1
  )
)

Upvotes: 0

shivsn
shivsn

Reputation: 7848

You can use string concatenation to combine columns, with or without delimiters. You do have to convert the type on non-string columns.

In[17]: df['combined'] = df['bar'].astype(str) + '_' + df['foo'] + '_' + df['new']

In[17]:df
Out[18]: 
   bar foo     new    combined
0    1   a   apple   1_a_apple
1    2   b  banana  2_b_banana
2    3   c    pear    3_c_pear

Upvotes: 116

Jane Kathambi
Jane Kathambi

Reputation: 945

First convert the columns to str. Then use the .T.agg('_'.join) function to concatenate them. More info can be gotten here

# Initialize columns
cols_concat = ['first_name', 'second_name']

# Convert them to type str
df[cols_concat] = df[cols_concat].astype('str')

# Then concatenate them as follows
df['new_col'] = df[cols_concat].T.agg('_'.join)

Upvotes: 7

StephenOK
StephenOK

Reputation: 21

You could create a function which would make the implementation neater (esp. if you're using this functionality multiple times throughout an implementation):

def concat_cols(df, cols_to_concat, new_col_name, separator):  
    df[new_col_name] = ''
    for i, col in enumerate(cols_to_concat):
        df[new_col_name] += ('' if i == 0 else separator) + df[col].astype(str)
    return df

Sample usage:

test = pd.DataFrame(data=[[1,2,3], [4,5,6], [7,8,9]], columns=['a', 'b', 'c'])
test = concat_cols(test, ['a', 'b', 'c'], 'concat_col', '_')

Upvotes: 2

Grzegorz
Grzegorz

Reputation: 1363

@derchambers I found one more solution:

import pandas as pd

# make data
df = pd.DataFrame(index=range(1_000_000))
df['1'] = 'CO'
df['2'] = 'BOB'
df['3'] = '01'
df['4'] = 'BILL'

def eval_join(df, columns):

    sum_elements = [f"df['{col}']" for col in columns]
    to_eval = "+ '_' + ".join(sum_elements)

    return eval(to_eval)


#profile
%timeit df3 = eval_join(df, list('1234')) # 504 ms

Upvotes: 2

Daniil Balabanov
Daniil Balabanov

Reputation: 117

If you have a list of columns you want to concatenate and maybe you'd like to use some separator, here's what you can do

def concat_columns(df, cols_to_concat, new_col_name, sep=" "):
    df[new_col_name] = df[cols_to_concat[0]]
    for col in cols_to_concat[1:]:
        df[new_col_name] = df[new_col_name].astype(str) + sep + df[col].astype(str)

This should be faster than apply and takes an arbitrary number of columns to concatenate.

Upvotes: 3

krassowski
krassowski

Reputation: 15449

Possibly the fastest solution is to operate in plain Python:

Series(
    map(
        '_'.join,
        df.values.tolist()
        # when non-string columns are present:
        # df.values.astype(str).tolist()
    ),
    index=df.index
)

Comparison against @MaxU answer (using the big data frame which has both numeric and string columns):

%timeit big['bar'].astype(str) + '_' + big['foo'] + '_' + big['new']
# 29.4 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


%timeit Series(map('_'.join, big.values.astype(str).tolist()), index=big.index)
# 27.4 ms ± 2.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Comparison against @derchambers answer (using their df data frame where all columns are strings):

from functools import reduce

def reduce_join(df, columns):
    slist = [df[x] for x in columns]
    return reduce(lambda x, y: x + '_' + y, slist[1:], slist[0])

def list_map(df, columns):
    return Series(
        map(
            '_'.join,
            df[columns].values.tolist()
        ),
        index=df.index
    )

%timeit df1 = reduce_join(df, list('1234'))
# 602 ms ± 39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df2 = list_map(df, list('1234'))
# 351 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Upvotes: 12

derchambers
derchambers

Reputation: 954

The answer given by @allen is reasonably generic but can lack in performance for larger dataframes:

Reduce does a lot better:

from functools import reduce

import pandas as pd

# make data
df = pd.DataFrame(index=range(1_000_000))
df['1'] = 'CO'
df['2'] = 'BOB'
df['3'] = '01'
df['4'] = 'BILL'


def reduce_join(df, columns):
    assert len(columns) > 1
    slist = [df[x].astype(str) for x in columns]
    return reduce(lambda x, y: x + '_' + y, slist[1:], slist[0])


def apply_join(df, columns):
    assert len(columns) > 1
    return df[columns].apply(lambda row:'_'.join(row.values.astype(str)), axis=1)

# ensure outputs are equal
df1 = reduce_join(df, list('1234'))
df2 = apply_join(df, list('1234'))
assert df1.equals(df2)

# profile
%timeit df1 = reduce_join(df, list('1234'))  # 733 ms
%timeit df2 = apply_join(df, list('1234'))   # 8.84 s

Upvotes: 9

Nipun Kumar Goel
Nipun Kumar Goel

Reputation: 201

df['New_column_name'] = df['Column1'].map(str) + 'X' + df['Steps']

X= x is any delimiter (eg: space) by which you want to separate two merged column.

Upvotes: 2

Allen
Allen

Reputation: 2455

Another solution using DataFrame.apply(), with slightly less typing and more scalable when you want to join more columns:

cols = ['foo', 'bar', 'new']
df['combined'] = df[cols].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)

Upvotes: 197

cbrnr
cbrnr

Reputation: 1751

If you have even more columns you want to combine, using the Series method str.cat might be handy:

df["combined"] = df["foo"].str.cat(df[["bar", "new"]].astype(str), sep="_")

Basically, you select the first column (if it is not already of type str, you need to append .astype(str)), to which you append the other columns (separated by an optional separator character).

Upvotes: 29

Manivannan Murugavel
Manivannan Murugavel

Reputation: 1588

df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})

df['combined'] = df['foo'].astype(str)+'_'+df['bar'].astype(str)

If you concatenate with string('_') please you convert the column to string which you want and after you can concatenate the dataframe.

Upvotes: 2

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210972

Just wanted to make a time comparison for both solutions (for 30K rows DF):

In [1]: df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})

In [2]: big = pd.concat([df] * 10**4, ignore_index=True)

In [3]: big.shape
Out[3]: (30000, 3)

In [4]: %timeit big.apply(lambda x:'%s_%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)
1 loop, best of 3: 881 ms per loop

In [5]: %timeit big['bar'].astype(str)+'_'+big['foo']+'_'+big['new']
10 loops, best of 3: 44.2 ms per loop

a few more options:

In [6]: %timeit big.ix[:, :-1].astype(str).add('_').sum(axis=1).str.cat(big.new)
10 loops, best of 3: 72.2 ms per loop

In [11]: %timeit big.astype(str).add('_').sum(axis=1).str[:-1]
10 loops, best of 3: 82.3 ms per loop

Upvotes: 18

milos.ai
milos.ai

Reputation: 3930

I think you are missing one %s

df['combined']=df.apply(lambda x:'%s_%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)

Upvotes: 8

Related Questions