Reputation: 13
I would like to change my string values to unique integer IDS for an entire dataframe, this is a simplified version of what I want to do. The real one has 20+ columns and 100,000 + rows. I need to convert this to do a fisher test per row which needs to differentiate between unique integers to see a difference between column groups.
X col1 col2 col3
1 0/0 1/1 0/0
2 0/2 0/0 1/1
3 1/2 0/2 1/1
4 0/0 0/0 0/0
to
X col1 col2 col3
1 1 2 1
2 3 1 2
3 4 3 2
4 1 1 1
Tried to factorize, but couldn't figure out how to do this for an entire dataframe like this, could only do this for a columns with the following code: df = df.apply(lambda x: pd.factorize(x)[0]).
What work too is to just do it per row as its parsed per row.
Upvotes: 1
Views: 71
Reputation: 5451
you can do it like this using apply function
df = pd.DataFrame([['0/0', '1/1', '0/0'], ['0/2', '0/0', '1/1'], ['1/2', '0/2', '1/1'], ['0/0', '0/0', '0/0']], columns=('col1', 'col2', 'col3'))
df2 = df.apply(lambda s: [sum(map(int,x.split("/"))) for x in s])
df2[df2==0] = 1
df2
Result
col1 col2 col3
0 1 2 1
1 2 1 2
2 3 2 2
3 1 1 1
Upvotes: 0
Reputation: 4301
Try this:
df = pd.DataFrame([['0/0', '1/1', '0/0'], ['0/2', '0/1', '1/1'], ['1/2', '0/2', '1/1'], ['0/0', '0/0', '0/0']])
d = {n:m for m, n in enumerate(list(set([j for i in df.values.tolist() for j in i])))}
df_new = df.replace(d)
Input:
0 1 2
0 0/0 1/1 0/0
1 0/2 0/1 1/1
2 1/2 0/2 1/1
3 0/0 0/0 0/0
Output:
0 1 2
0 2 4 2
1 1 3 4
2 0 1 4
3 2 2 2
Upvotes: 0
Reputation: 25239
Use df.rank
with method='dense'
. Each unique string will be assigned an unique number/rank
df_final = df.set_index('X').rank(method='dense').astype(int)
Out[244]:
col1 col2 col3
X
1 1 3 1
2 2 1 2
3 3 2 2
4 1 1 1
Upvotes: 1