Reputation: 71600
I just came across this question, how do I do str.join
by one column to join the other, here is my DataFrame
:
>>> df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
a b
0 a hello
1 b good
2 c great
3 d nice
I would like the a
column to join the values in the b
column, so my desired output is:
a b
0 a haealalao
1 b gbobobd
2 c gcrcecact
3 d ndidcde
How would I go about that?
Hope you can see the correlation with this, here is one example with the first row that you can do in python:
>>> 'a'.join('hello')
'haealalao'
>>>
Just like in the desired output.
I think it might be useful to know how two columns can interact. join
might not be the best example but there are other functions that you could do. It could maybe be useful if you use split
to split
on the other columns, or replace the characters in the other columns with something else.
P.S. I have a self-answer below.
Upvotes: 1
Views: 1169
Reputation: 34086
Here's another solution using zip
and list comprehension
. Should be better than df.apply
:
In [1576]: df.b = [i.join(j) for i,j in zip(df.a, df.b)]
In [1578]: df
Out[1578]:
a b
0 a haealalao
1 b gbobobd
2 c gcrcecact
3 d ndidcde
Upvotes: 2
Reputation: 1074
I tried achieving the output using df.apply
>>> df.apply(lambda x: x['a'].join(x['b']), axis=1)
0 haealalao
1 gbobobd
2 gcrcecact
3 ndidcde
dtype: object
Timing it for performance comparison,
from timeit import timeit
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
def u11_1():
it = iter(df['a'])
df['b'] = [next(it).join(i) for i in df['b']]
def u11_2():
df['b'] = df.groupby(df.index).apply(lambda x: x['a'].item().join(x['b'].item()))
def u11_3():
df['b'] = [x.join(y) for x, y in df.values.tolist()]
def u11_4():
df['c'] = df.apply(lambda x: x['a'].join(x['b']), axis=1)
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
print('Solution 1:', timeit(u11_1, number=5))
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
print('Solution 2:', timeit(u11_2, number=5))
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
print('Solution 3:', timeit(u11_3, number=5))
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
print('Solution 4:', timeit(u11_4, number=5))
Note that I am reinitializing df before every line so that all the functions process the same dataframe. It can also be done by passing the df as a parameter to the function.
Upvotes: 2
Reputation: 71600
The below code is the fastest answer I could figure out from this question:
it = iter(df['a'])
df['b'] = [next(it).join(i) for i in df['b']]
The above code first does a generator of the a
column, then you can use next
for getting the next value every time, then in the list comprehension it joins the two strings.
Going to show my solutions:
Solution 1:
To use a list
comprehension and a generator:
it = iter(df['a'])
df['b'] = [next(it).join(i) for i in df['b']]
print(df)
Solution 2:
Group by the index, and apply
and str.join
the two columns' value:
df['b'] = df.groupby(df.index).apply(lambda x: x['a'].item().join(x['b'].item()))
print(df)
Solution 3:
Use a list
comprehension that iterates through both columns and str.join
s:
df['b'] = [x.join(y) for x, y in df.values.tolist()]
print(df)
These codes all output:
a b
0 a haealalao
1 b gbobobd
2 c gcrcecact
3 d ndidcde
Now it's time to move on to timing with the timeit
module, here is the code we use to time:
from timeit import timeit
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
def u11_1():
it = iter(df['a'])
df['b'] = [next(it).join(i) for i in df['b']]
def u11_2():
df['b'] = df.groupby(df.index).apply(lambda x: x['a'].item().join(x['b'].item()))
def u11_3():
df['b'] = [x.join(y) for x, y in df.values.tolist()]
print('Solution 1:', timeit(u11_1, number=5))
print('Solution 2:', timeit(u11_2, number=5))
print('Solution 3:', timeit(u11_3, number=5))
Output:
Solution 1: 0.007374127670871819
Solution 2: 0.05485127553865618
Solution 3: 0.05787154087587698
So the first solution is the quickest, using a generator.
Upvotes: 3