Reputation: 14450
I have file containing a graph. The file is a csv with two columns, source name and target name. I'd like to generate a file/dataframe with source and target being numerical IDs instead of strings.
I know this can be done in python, with something like
node_names = list(set(source_node_names) | set(target_node_names))
names_to_ids = invert(dict(enumerate(node_names)))
# followed by some sort of replace operation using this dictionary
But I'm trying to learn pandas, and felt this would be a good opportunity to do so.
Does it make sense to use a Categorical for this problem? I didn't think so, since each of my nodes is not a category, but my googling was steering me in that direction.
Right now I have my names in a series. Series.factorize seemed promising, but I'm not entirely clear on what the return is, largely because I'm not clear on what a pandas Index is.
Meta-question #1: Is there a good explanation of pandas Index somewhere? I can't find a "pandas Index tutorial" probably in part because pandas has a tutorial called Indexing and Selecting Data which I think answers most people's questions, but assumes some knowledge I don't have:
The pandas Index class and its subclasses can be viewed as implementing an ordered multiset.
=0
Meta-question #2: Are there any excellent resources for learning pandas?So far I've been doing a lot of googling for docs and stackoverflowing, but I may want to learn pandas properly. Are there large tutorials or books people trust?
I think I could take my series of node names, somehow explicitly make it a dataframe such that one column is the indices and another is the name, then do two merges with the graph dataframe, once on source and once on target, such that I have source IDs and target IDs, then just hold on to those two fields. It feels to me like there must be a better way than two merges.
As requested by @Cleb. Input:
# I have And I want:
RNF14 VDR 0 1
RNF14 SMAD 0 2
RNF14 UBE2D4 0 3
RNF14 EIF2B5 0 4
RNF14 UBE2D2 0 5
RNF14 SMAD 0 6
RNF14 UBE2D1 0 7
RNF14 UBE2D3 0 8
RNF14 IST1 0 9
RNF14 EXOSC3 0 10
RNF14 EXOSC5 0 11
RNF14 SMURF1 0 12
RNF14 SMURF2 0 13
Obviously this is a trivial case. I have about a million edges in my graph, for maybe 100k nodes.
It seems like factorizing might be what I want, but I want to factorize two columns of a dataframe in the same index space, which seems non-obvious.
I've built an index from names to IDs, I just don't know how to replace my original dataframe with the IDs. This would be some kind of "merge" operation I'm not familiar with.
Upvotes: 1
Views: 178
Reputation: 29690
I would opt for pd.factorize()
across columns.
df.apply(lambda col: pd.factorize(col)[0]+1)
If you want unique ids in each column, you could just unstack first.
stacked = df.stack()
pd.DataFrame(stacked.factorize()[0], index=stacked.index).unstack()
Demo
>>> df = pd.DataFrame(dict(const=['things']*12,
unqs=['foo']*4+['bar']*3+['baz']*5))
>>> df
const unqs
0 things foo
1 things foo
2 things foo
3 things foo
4 things bar
5 things bar
6 things bar
7 things baz
8 things baz
9 things baz
10 things baz
11 things baz
>>> stacked = df.stack()
>>> pd.DataFrame(stacked.factorize()[0], index=stacked.index).unstack()
0
const unqs
0 0 1
1 0 1
2 0 1
3 0 1
4 0 2
5 0 2
6 0 2
7 0 3
8 0 3
9 0 3
10 0 3
11 0 3
Upvotes: 1