sbradbio
sbradbio

Reputation: 169

string operation on pandas df

pandas df with 11 columns need to modify first 3 columns using regex and add a new column with this modified column and us this for downstream concatenation, something like this I need to keep the element as is of these columns and make it a unique string

column1 column2 column3 column4 ...column 11

need to do this new_col = column1:column2-column3(column4)

and make this new column,

column1 column2 column3 newcol column4 ...column 11

I could do this using simple python one line, not sure what is the syntax for pandas

l = cols[0] + ":" + cols[1] + "-" + cols[2] + "(" + cols[5] + ")"

Upvotes: 0

Views: 885

Answers (3)

Matthias Fripp
Matthias Fripp

Reputation: 18625

Based on an answer that was recently deleted, this works fine:

df1 = pd.DataFrame({
    'chrom': ['a', 'b', 'c'], 
    'start': ['d', 'e', 'f'], 
    'end': ['g', 'h', 'i'], 
    'strand': ['j', 'k', 'l']}
)
df1['unique_col'] = df1.chrom + ':' + df1.start + '-' + df1.end + '(' + df1.strand + ')'

It sounds like your original dataframe may not contain strings. If it contains numbers, you need something like this:

df1 = pd.DataFrame({
    'chrom': [1.0, 2.0], 
    'start': [3.0, 4.0], 
    'end': [5.0, 6.0], 
    'strand': [7.0, 8.0]}
)
df1['unique_col'] = (
    df1.chrom.astype(str) + ':' 
    + df1.start.astype(str) + '-' + df1.end.astype(str)
    + '(' + df1.strand.astype(str) + ')'
)

Upvotes: 1

piRSquared
piRSquared

Reputation: 294488

consider the dataframe df

np.random.seed([3,1415])
df = pd.DataFrame(np.random.choice(a, (5, 10))).add_prefix('col ')

print(df)

  col 0 col 1 col 2 col 3 col 4 col 5 col 6 col 7 col 8 col 9
0     Q     L     C     K     P     X     N     L     N     T
1     I     X     A     W     Y     M     W     A     C     A
2     U     Z     H     T     N     S     M     E     D     T
3     N     W     H     X     N     U     F     D     X     F
4     Z     L     Y     H     M     G     E     H     W     S

Construct a custom format function

f = lambda row: '{col 1}:{col 2}-{col 3}({col 4})'.format(**row)

And apply to df

df.astype(str).apply(f, 1)

0    L:C-K(P)
1    W:A-C(A)
2    W:H-X(N)
3    E:H-W(S)
4    Y:E-P(N)
dtype: object

Add a new column with assign

df.assign(New=df.astype(str).apply(f, 1))
# assign in place with
# df['New'] = df.astype(str).apply(f, 1)

  col 0 col 1 col 2 col 3 col 4 col 5 col 6 col 7 col 8 col 9       New
0     Q     L     C     K     P     X     N     L     N     T  L:C-K(P)
1     I     X     A     W     Y     M     W     A     C     A  X:A-W(Y)
2     U     Z     H     T     N     S     M     E     D     T  Z:H-T(N)
3     N     W     H     X     N     U     F     D     X     F  W:H-X(N)
4     Z     L     Y     H     M     G     E     H     W     S  L:Y-H(M)

Or you can wrap this up into another function that operates on pd.Series. This requires that you pass the columns in the correct order.

def u(a, b, c, d):
    return a + ':' + b + '-' + c + '(' + d + ')'

df.assign(New=u(df['col 1'], df['col 2'], df['col 3'], df['col 4']))
# assign in place with
# df['New'] = u(df['col 1'], df['col 2'], df['col 3'], df['col 4'])

  col 0 col 1 col 2 col 3 col 4 col 5 col 6 col 7 col 8 col 9       New
0     Q     L     C     K     P     X     N     L     N     T  L:C-K(P)
1     I     X     A     W     Y     M     W     A     C     A  X:A-W(Y)
2     U     Z     H     T     N     S     M     E     D     T  Z:H-T(N)
3     N     W     H     X     N     U     F     D     X     F  W:H-X(N)
4     Z     L     Y     H     M     G     E     H     W     S  L:Y-H(M)

Upvotes: 2

Grr
Grr

Reputation: 16099

You should just be able to do it with the same syntax you posted as long as all of the columns contain strings.

You can also use the Series.str.cat method.

df['new_col'] = cols[0].str.cat(':' + cols[1] + '-' + cols[2] + '(' + cols[5]+ ')')

Upvotes: 3

Related Questions