U13-Forward
U13-Forward

Reputation: 71600

How to string join one column with another columns - pandas

I just came across this question, how do I do str.join by one column to join the other, here is my DataFrame:

>>> df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
   a      b
0  a  hello
1  b   good
2  c  great
3  d   nice

I would like the a column to join the values in the b column, so my desired output is:

   a          b
0  a  haealalao
1  b    gbobobd
2  c  gcrcecact
3  d    ndidcde

How would I go about that?

Hope you can see the correlation with this, here is one example with the first row that you can do in python:

>>> 'a'.join('hello')
'haealalao'
>>> 

Just like in the desired output.

I think it might be useful to know how two columns can interact. join might not be the best example but there are other functions that you could do. It could maybe be useful if you use split to split on the other columns, or replace the characters in the other columns with something else.

P.S. I have a self-answer below.

Upvotes: 1

Views: 1169

Answers (3)

Mayank Porwal
Mayank Porwal

Reputation: 34086

Here's another solution using zip and list comprehension. Should be better than df.apply:

In [1576]: df.b = [i.join(j) for i,j in zip(df.a, df.b)]

In [1578]: df
Out[1578]: 
   a          b
0  a  haealalao
1  b    gbobobd
2  c  gcrcecact
3  d    ndidcde

Upvotes: 2

Akash Ranjan
Akash Ranjan

Reputation: 1074

I tried achieving the output using df.apply

>>> df.apply(lambda x: x['a'].join(x['b']), axis=1)
0    haealalao
1      gbobobd
2    gcrcecact
3      ndidcde
dtype: object

Timing it for performance comparison,

from timeit import timeit
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})

def u11_1():
    it = iter(df['a'])
    df['b'] = [next(it).join(i) for i in df['b']]

def u11_2():
    df['b'] = df.groupby(df.index).apply(lambda x: x['a'].item().join(x['b'].item()))

def u11_3():
    df['b'] = [x.join(y) for x, y in df.values.tolist()]

def u11_4():
    df['c'] = df.apply(lambda x: x['a'].join(x['b']), axis=1)

df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
print('Solution 1:', timeit(u11_1, number=5))
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
print('Solution 2:', timeit(u11_2, number=5))
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
print('Solution 3:', timeit(u11_3, number=5))
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
print('Solution 4:', timeit(u11_4, number=5))

Note that I am reinitializing df before every line so that all the functions process the same dataframe. It can also be done by passing the df as a parameter to the function.

Upvotes: 2

U13-Forward
U13-Forward

Reputation: 71600

TL;DR

The below code is the fastest answer I could figure out from this question:

it = iter(df['a'])
df['b'] = [next(it).join(i) for i in df['b']]

The above code first does a generator of the a column, then you can use next for getting the next value every time, then in the list comprehension it joins the two strings.

Long answer:

Going to show my solutions:

Solution 1:

To use a list comprehension and a generator:

it = iter(df['a'])
df['b'] = [next(it).join(i) for i in df['b']]
print(df)

Solution 2:

Group by the index, and apply and str.join the two columns' value:

df['b'] = df.groupby(df.index).apply(lambda x: x['a'].item().join(x['b'].item()))
print(df)

Solution 3:

Use a list comprehension that iterates through both columns and str.joins:

df['b'] = [x.join(y) for x, y in df.values.tolist()]
print(df)

These codes all output:

   a          b
0  a  haealalao
1  b    gbobobd
2  c  gcrcecact
3  d    ndidcde

Timing:

Now it's time to move on to timing with the timeit module, here is the code we use to time:

from timeit import timeit
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
def u11_1():
    it = iter(df['a'])
    df['b'] = [next(it).join(i) for i in df['b']]
    
def u11_2():
    df['b'] = df.groupby(df.index).apply(lambda x: x['a'].item().join(x['b'].item()))
    
def u11_3():
    df['b'] = [x.join(y) for x, y in df.values.tolist()]

print('Solution 1:', timeit(u11_1, number=5))
print('Solution 2:', timeit(u11_2, number=5))
print('Solution 3:', timeit(u11_3, number=5))

Output:

Solution 1: 0.007374127670871819
Solution 2: 0.05485127553865618
Solution 3: 0.05787154087587698

So the first solution is the quickest, using a generator.

Upvotes: 3

Related Questions