Reputation: 5552
Given a dataframe df
, (real life is a +1000 row df). Elements of ColB
are lists of lists.
ColA ColB
0 'A' [['a','b','c'],['d','e','f']]
1 'B' [['f','g','h'],['i','j','k']]
2 'A' [['l','m','n'],['o','p','q']]
How can efficiently create a ColC
that is a string using the elements in the different columns, like this:
ColC
'A>+a b:c,+d e:f'
'B>+f g:h,+i j:k'
'A>+l m:n,+o p:q'
I tried with df.apply
along these lines, inspired by this:
df['ColC'] = df.apply(lambda x:'%s>' % (x['ColA']),axis=1)
This works for the first 2 elements of the string. Having a hard time with the rest.
Upvotes: 5
Views: 240
Reputation: 153560
Something like this?
df['ColC'] = df.ColA + '>+' + df.ColB.str[0].str[0] + \
' ' + df.ColB.str[0].str[1] + ':' + \
df.ColB.str[0].str[2] + ',+' + \
df.ColB.str[1].str[0] + ' ' + \
df.ColB.str[1].str[1] + ':' + \
df.ColB.str[1].str[2]
Output:
ColA ColB ColC
0 A [[a, b, c], [d, e, f]] A>+a b:c,+d e:f
1 B [[f, g, h], [i, j, k]] B>+f g:h,+i j:k
2 A [[l, m, n], [o, p, q]] A>+l m:n,+o p:q
df = pd.concat([df]*333)
Wen's Method
%%timeit df[['t1','t2']]=df['ColB'].apply(pd.Series).applymap(lambda x : ('{} {}:{}'.format(x[0],x[1],x[2]))) df.ColA+'>+'+df.t1+',+'+df.t2
1 loop, best of 3: 363 ms per loop
miradulo Method
%%timeit df.apply(lambda r:'{}>+{} {}:{},+{} {}:{}'.format(*flatten(r)), axis=1)
10 loops, best of 3: 74.9 ms per loop
ScottBoston Method
%%timeit df.ColA + '>+' + df.ColB.str[0].str[0] + \ ' ' + df.ColB.str[0].str[1] + ':' + \ df.ColB.str[0].str[2] + ',+' + \ df.ColB.str[1].str[0] + ' ' + \ df.ColB.str[1].str[1] + ':' + \ df.ColB.str[1].str[2]
100 loops, best of 3: 12.4 ms per loop
Upvotes: 3
Reputation: 29740
If we make use of a flatten
function as follows
def flatten(l):
for el in l:
if isinstance(el, collections.Iterable) and not isinstance(el, (str, bytes)):
yield from flatten(el)
else:
yield el
as seen in this answer, then we can easily apply
a string formatting with the flattened elements.
>>> df.apply(lambda r:'{}>+{} {}:{},+{} {}:{}'.format(*flatten(r.values)), axis=1)
0 A>+a b:c,+d e:f
1 B>+f g:h,+i j:k
2 A>+l m:n,+o p:q
dtype: object
This would hopefully generalize somewhat well.
>>> row_formatter = lambda r: '{}>+{} {}:{},+{} {}:{}'.format(*flatten(r.values))
>>> df.apply(row_formatter, 1)
0 A>+a b:c,+d e:f
1 B>+f g:h,+i j:k
2 A>+l m:n,+o p:q
dtype: object
Upvotes: 2
Reputation: 2696
Yet another answer:
df['ColC'] = df.apply(lambda x: '%s>+%s %s:%s,+%s%s:%s'% tuple([x['ColA']]+x['ColB'][0]+x['ColB'][1]),axis=1)
Upvotes: 2
Reputation: 13725
Here's my 2 cents also using apply
Define a function that you can apply to the dataframe and use string formatting to parse your columns
def get_string(x):
col_a = x.ColA
col_b = (ch for ch in x.ColB if ch.isalnum())
string = '{0}>+{1} {2}:{3},+{4} {5}:{6}'.format(col_a.strip("\'"), *col_b)
return(string)
df['ColC'] = df.apply(get_string, axis=1)
df.ColC
0 A>+a b:c,+d e:f
1 B>+f g:h,+i j:k
2 A>+l m:n,+o p:q
I like this because it's easy to modify the format, though using apply in this way might be slow
Upvotes: 1
Reputation: 323396
You are right using apply
df[['t1','t2']]=df['colB'].apply(pd.Series).applymap(lambda x : ('{} {}:{}'.format(x[0],x[1],x[2])))
df.colA+'>+'+df.t1+',+'+df.t2
Out[648]:
0 A>+a b:c,+d e:f
1 B>+f g:h,+i j:k
2 C>+l m:n,+o p:q
Upvotes: 2