Reputation: 417
I would to add to the CellID column a number in the way to classify them. The dataframe is this:
umap
CellID wnnUMAP_1 wnnUMAP_2
0 KO_d0_r1:AAACAGCCACCTGCTCx -8.127543 1.593849
1 KO_d0_r2:AAACAGCCACGTAATTx -7.246094 -4.566527
2 HT_d0_r1:AAACAGCCATAATGAGx 7.617473 2.449949
3 HT_d0_r2:AAACATGCACCTAATGx -7.944949 6.633856
And my resoult would be this one
umap
CellID wnnUMAP_1 wnnUMAP_2
0 KO_d0_r1:AAACAGCCACCTGCTCx-0 -8.127543 1.593849
1 KO_d0_r2:AAACAGCCACGTAATTx-1 -7.246094 -4.566527
2 HT_d0_r1:AAACAGCCATAATGAGx-2 7.617473 2.449949
3 HT_d0_r2:AAACATGCACCTAATGx-3 -7.944949 6.633856
I would to add the 0 to KO_d0_r1, a -1 to KO_d0_r2, a -2 to HT_do_r1 and a -3 HT_d0_r2.
This is just an example, I have a lot of strings that have the prefix KO_d0_r1
, ecc., so I would to distinguish them by the suffix.
My attempt was:
umap = umap.rename(columns = {'Unnamed: 0':'CellID'})
But it doesn't work
Upvotes: 2
Views: 72
Reputation: 527
another approach, and simpler solution that don't require mapping, especially if you have big number of uniques CellID.
df['CellID']
:df['CellID'] = df['CellID'] + '-' + (df.index + 1).astype(str)
df['CellID']
contains duplicates:df
CellID wnnUMAP_1 wnnUMAP_2
0 KO_d0_r1:AAACAGCCACCTGCTCx -8.127543 1.593849
1 KO_d0_r2:AAACAGCCACGTAATTx -7.246094 -4.566527
2 HT_d0_r1:AAACAGCCATAATGAGx 7.617473 2.449949
3 HT_d0_r2:AAACATGCACCTAATGx -7.944949 6.633856
4 HT_d0_r2:AAACATGCACCTAATGx -6.944949 2.633856
5 HT_d0_r2:AAACATGCACCTAATGx -5.944949 3.633856
df = df.merge((df['CellID'].drop_duplicates() + '-' + (df['CellID'].drop_duplicates().index + 1).astype(str)).reset_index(name='CellID_classified').eval('CellID= CellID_classified.str.split("-").str[0]').drop('index', axis=1), on='CellID', how='left').drop('CellID', axis=1)
df
wnnUMAP_1 wnnUMAP_2 CellID_classified
0 -8.127543 1.593849 KO_d0_r1:AAACAGCCACCTGCTCx-1
1 -7.246094 -4.566527 KO_d0_r2:AAACAGCCACGTAATTx-2
2 7.617473 2.449949 HT_d0_r1:AAACAGCCATAATGAGx-3
3 -7.944949 6.633856 HT_d0_r2:AAACATGCACCTAATGx-4
4 -6.944949 2.633856 HT_d0_r2:AAACATGCACCTAATGx-4
5 -5.944949 3.633856 HT_d0_r2:AAACATGCACCTAATGx-4
Upvotes: 2
Reputation: 18406
Create a dictionary containing mapping of the prefixes to the corresponding suffix value of interest, then split CellID
on :
with n=1
which will basically split 1 times at max, then call Series.str.map
passing the dictionary mapping object. You can finally join with the cellID
column.
mapping = {'KO_d0_r1':'0', 'KO_d0_r2':'1', 'HT_d0_r1': '2', 'HT_d0_r2':'3'}
umap['CellID']=umap['CellID']\
+'-'\
+umap['CellID'].str.split(':', n=1).str[0].map(mapping)
OUTPUT
CellID wnnUMAP_1 wnnUMAP_2
0 KO_d0_r1:AAACAGCCACCTGCTCx-0 -8.127543 1.593849
1 KO_d0_r2:AAACAGCCACGTAATTx-1 -7.246094 -4.566527
2 HT_d0_r1:AAACAGCCATAATGAGx-2 7.617473 2.449949
3 HT_d0_r2:AAACATGCACCTAATGx-3 -7.944949 6.633856
PS: map
returns NaN
for values that could not be mapped which may throw a TypeError
, for the data, I just assumed that it is always going to exist, else, you may want to handle it.
If you are not so concerned about the suffices and just require a unique number to be assigned, you can also use groupby
then call ngroup()
:
umap['CellID'] = umap['CellID'] \
+ '-' \
+ (umap
.groupby(umap['CellID'].str.split(':', n=1).str[0], sort=False)
.ngroup()
.astype('str')
)
Upvotes: 0
Reputation: 3664
You can use .cat() to concatenate strings.
df["CellID"] = df["CellID"].str.cat([df.index.map(str)], sep="-")
https://pandas.pydata.org/docs/reference/api/pandas.Series.str.cat.html
import pandas as pd
data = [["KO_d0_r1:AAACAGCCACCTGCTCx", -8.127543, 1.593849],
["KO_d0_r2:AAACAGCCACGTAATTx", -7.246094, -4.566527],
["HT_d0_r1:AAACAGCCATAATGAGx", 7.617473, 2.449949]]
df = pd.DataFrame(data, columns=["CellID", "wnnUMAP_1", "wnnUMAP_2"])
df["CellID"] = df["CellID"].str.cat([df.index.map(str)], sep="-")
df is now:
CellID wnnUMAP_1 wnnUMAP_2
0 KO_d0_r1:AAACAGCCACCTGCTCx-0 -8.127543 1.593849
1 KO_d0_r2:AAACAGCCACGTAATTx-1 -7.246094 -4.566527
2 HT_d0_r1:AAACAGCCATAATGAGx-2 7.617473 2.449949
Upvotes: 1