Alex Lenail
Alex Lenail

Reputation: 14450

How to replace items with their indices in a pandas series

Problem Statement

I have file containing a graph. The file is a csv with two columns, source name and target name. I'd like to generate a file/dataframe with source and target being numerical IDs instead of strings.

I know this can be done in python, with something like

node_names = list(set(source_node_names) | set(target_node_names))
names_to_ids = invert(dict(enumerate(node_names)))  
# followed by some sort of replace operation using this dictionary

But I'm trying to learn pandas, and felt this would be a good opportunity to do so.

Some questions:

My best idea so far:

I think I could take my series of node names, somehow explicitly make it a dataframe such that one column is the indices and another is the name, then do two merges with the graph dataframe, once on source and once on target, such that I have source IDs and target IDs, then just hold on to those two fields. It feels to me like there must be a better way than two merges.

Some sample data

As requested by @Cleb. Input:

# I have                And I want: 
    RNF14   VDR             0   1
    RNF14   SMAD            0   2
    RNF14   UBE2D4          0   3
    RNF14   EIF2B5          0   4
    RNF14   UBE2D2          0   5
    RNF14   SMAD            0   6
    RNF14   UBE2D1          0   7
    RNF14   UBE2D3          0   8
    RNF14   IST1            0   9
    RNF14   EXOSC3          0   10
    RNF14   EXOSC5          0   11
    RNF14   SMURF1          0   12
    RNF14   SMURF2          0   13

Obviously this is a trivial case. I have about a million edges in my graph, for maybe 100k nodes.

Update #1:

It seems like factorizing might be what I want, but I want to factorize two columns of a dataframe in the same index space, which seems non-obvious.

I've built an index from names to IDs, I just don't know how to replace my original dataframe with the IDs. This would be some kind of "merge" operation I'm not familiar with.

Upvotes: 1

Views: 178

Answers (1)

miradulo
miradulo

Reputation: 29690

I would opt for pd.factorize() across columns.

df.apply(lambda col: pd.factorize(col)[0]+1)

If you want unique ids in each column, you could just unstack first.

stacked = df.stack()
pd.DataFrame(stacked.factorize()[0], index=stacked.index).unstack()

Demo

>>> df = pd.DataFrame(dict(const=['things']*12, 
                           unqs=['foo']*4+['bar']*3+['baz']*5))

>>> df
     const unqs
0   things  foo
1   things  foo
2   things  foo
3   things  foo
4   things  bar
5   things  bar
6   things  bar
7   things  baz
8   things  baz
9   things  baz
10  things  baz
11  things  baz

>>> stacked = df.stack()

>>> pd.DataFrame(stacked.factorize()[0], index=stacked.index).unstack()
       0     
   const unqs
0      0    1
1      0    1
2      0    1
3      0    1
4      0    2
5      0    2
6      0    2
7      0    3
8      0    3
9      0    3
10     0    3
11     0    3

Upvotes: 1

Related Questions