Reputation: 409
I have a dataframe which contains a list of domains (or vertices/nodes in my case) which I'm storing through pandas library:
domain
0 airbnb.com
1 facebook.com
2 st.org
3 index.co
4 crunchbase.com
5 avc.com
6 techcrunch.com
7 google.com
I have another dataframe which contains the connections between these domains (aka edges):
source_domain destination_domain
0 airbnb.com google.com
1 facebook.com google.com
2 st.org facebook.com
3 st.org airbnb.com
4 st.org crunchbase.com
5 index.co techcrunch.com
6 crunchbase.com techcrunch.com
7 crunchbase.com airbnb.com
8 avc.com techcrunch.com
9 techcrunch.com st.org
10 techcrunch.com google.com
11 techcrunch.com facebook.com
since this dataset will get much larger, I read that I can have faster performance if I represent the "edges" dataframe only with integers instead of strings.
So, I'm wondering if there is a fast way to replace each cell in the edges dataframe with the corresponding id from the domains (aka vertices) dataframe? So row 1 in the edges dataframe might end up looking like:
###### Before: #####################
1 facebook.com google.com
###### After: #####################
1 1 7
How can I go about doing this? Thank you in advance.
Upvotes: 1
Views: 2663
Reputation: 862741
I try implement another answer - convert to Catagorical
and for ints
use cat.codes
:
#if always unique domain in df1 can be omit
#cats = df1['domain'].unique()
cats = df1['domain']
df2['source_domain'] = df2['source_domain'].astype('category', categories=cats)
df2['destination_domain'] = df2['destination_domain'].astype('category', categories=cats)
df2['source_code'] = df2['source_domain'].cat.codes
df2['dest_code'] = df2['destination_domain'].cat.codes
print (df2)
source_domain destination_domain source_code dest_code
0 airbnb.com google.com 0 7
1 facebook.com google.com 1 7
2 st.org facebook.com 2 1
3 st.org airbnb.com 2 0
4 st.org crunchbase.com 2 4
5 index.co techcrunch.com 3 6
6 crunchbase.com techcrunch.com 4 6
7 crunchbase.com airbnb.com 4 0
8 avc.com techcrunch.com 5 6
9 techcrunch.com st.org 6 2
10 techcrunch.com google.com 6 7
11 techcrunch.com facebook.com 6 1
df2['source_domain'] = df2['source_domain'].astype('category', categories=cats).cat.codes
df2['destination_domain'] = df2['destination_domain'].astype('category', categories=cats)
.cat.codes
print (df2)
source_domain destination_domain
0 0 7
1 1 7
2 2 1
3 2 0
4 2 4
5 3 6
6 4 6
7 4 0
8 5 6
9 6 2
10 6 7
11 6 1
If want replace by dict
use map
:
d = dict(zip(df1.domain.values, df1.index.values))
df2['source_code'] = df2['source_domain'].map(d)
df2['dest_code'] = df2['destination_domain'].map(d)
print (df2)
source_domain destination_domain source_code dest_code
0 airbnb.com google.com 0 7
1 facebook.com google.com 1 7
2 st.org facebook.com 2 1
3 st.org airbnb.com 2 0
4 st.org crunchbase.com 2 4
5 index.co techcrunch.com 3 6
6 crunchbase.com techcrunch.com 4 6
7 crunchbase.com airbnb.com 4 0
8 avc.com techcrunch.com 5 6
9 techcrunch.com st.org 6 2
10 techcrunch.com google.com 6 7
11 techcrunch.com facebook.com 6 1
Upvotes: 2
Reputation: 294318
The simplest way to do this is to generate a dictionary from the vertices dataframe... IF we can be sure that it represents the definitive set of vertices that will show up in the edges... and use it with replace
Since the index of the vertices dataframe already has the factor information...
m = dict(zip(vertices.domain, vertices.index))
edges.replace(m)
source_domain destination_domain
0 0 7
1 1 7
2 2 1
3 2 0
4 2 4
5 3 6
6 4 6
7 4 0
8 5 6
9 6 2
10 6 7
11 6 1
You can also use stack
/map
/unstack
m = dict(zip(vertices.domain, vertices.index))
edges.stack().map(m).unstack()
source_domain destination_domain
0 0 7
1 1 7
2 2 1
3 2 0
4 2 4
5 3 6
6 4 6
7 4 0
8 5 6
9 6 2
10 6 7
11 6 1
editorial
I wanted to comment on @JohnZwinck's answer in addition to providing information of my own.
First, categorical
would provide faster performance. However, I'm unclear of a way to ensure that you can have two columns of coordinated categories. What I mean by coordinated is that each column gets a set integers assigned to each category behind the scenes. We have know way to know or enforce (Not that I know of) that these integers are the same. If we made it one big column, then converted that column to a categorical, that would work... However, I believe that it would turn back to object once we split up into two columns again.
Upvotes: 2
Reputation: 249223
This is a good use case for Categorial Data: http://pandas.pydata.org/pandas-docs/stable/categorical.html
In short, Categorical Series will internally represent each item as a number, but display it as a string. This is useful when you have a lot of repeated strings.
It's easier and less error-prone to use Categorical Series vs converting everything to integers manually.
Upvotes: 2