elliot
elliot

Reputation: 1944

How to create a unique identifier ID across columns?

I'm trying to prep data to be used in for various network visualisation applications in R and also Gephi. These formats want numeric identifiers that link between two databases. I have figured out the latter part, but I'm not able to find a succinct way to create a numeric ID variable across columns in a dataframe. Here's some replicable code that illustrates what I'm trying to do.

org.data <- data.frame(source=c('bob','sue','ann','john','sinbad'),
       target=c('sinbad','turtledove','Aerosmith','bob','john'))

desired.data <- data.frame(source=c('1','2','3','4','5'),
                       target=c('5','6','7','1','4'))


org.data

  source     target
1    bob     sinbad
2    sue     turtledove
3    ann     Aerosmith
4    john    bob
5    sinbad  john

desired.data

  source target
1    1      5
2    2      6
3    3      7
4    4      1
5    5      4

Upvotes: 1

Views: 80

Answers (4)

lmo
lmo

Reputation: 38510

Here's a base R method using match on the unlisted unique names in the original data.frame.

To replace the current data.frame, use

org.data[] <- sapply(org.data, match, table=unique(unlist(org.data)))

Here, sapply loops through the variables in org.data, and applies match to each. match returns the position of of the first argument in the table argument. Here, table is the unlisted unique elements in org.data: unique(unlist(org.data)). In this case, sapply returns a matrix. It is converted to a data.frame, replacing the original by appending [] to org.data in org.data[] <-. This construction can be thought of as preserving the structure of the original object during the assignment.

To construct a new data.frame, use

setNames(data.frame(sapply(org.data, match, table=unique(unlist(org.data)))),
         names(org.data))

Or better, as Henrik suggests, it would probably be easier to first create a copy of the data.frame and then use the first line of code to fill in the copy rather than using setNames and data.frame.

desired.data <- org.data

Both of these return

  source target
1      1      5
2      2      6
3      3      7
4      4      1
5      5      4

Upvotes: 4

DatamineR
DatamineR

Reputation: 9618

You could try this:

org.data[] <- as.numeric(factor(c(as.matrix(org.data)), levels = unique(c(as.matrix(org.data)))))
org.data
  source target
1      1      5
2      2      6
3      3      7
4      4      1
5      5      4

Upvotes: 3

Kota Mori
Kota Mori

Reputation: 6750

Convert to factors, then to integers.

org.data <- data.frame(source=c('bob','sue','ann','john','sinbad'),
                       target=c('sinbad','turtledove','Aerosmith','bob','john'))

# need to make sure that columns are characters, not factors
org.data$source <- as.character(org.data$source)
org.data$target <- as.character(org.data$target)

# define possible values that cover the two columns
levels <- unique(c(org.data$source, org.data$target))

# factorize, then cast to integer
org.data$source <- as.integer(factor(org.data$source, levels=levels))
org.data$target <- as.integer(factor(org.data$target, levels=levels))

org.data

Upvotes: 0

Roman
Roman

Reputation: 17648

You can try following. The idea is to create factors using levels over all unique names.

library(tidyverse)
org.data %>% 
  mutate(source2 = factor(source, levels=unique(unlist(org.data)) ,  labels=1:length(unique(unlist(org.data))))) %>% 
  mutate(target2 = factor(target, levels=unique(unlist(org.data)) ,  labels=1:length(unique(unlist(org.data)))))
  source     target source2 target2
1    bob     sinbad       1       5
2    sue turtledove       2       6
3    ann  Aerosmith       3       7
4   john        bob       4       1
5 sinbad       john       5       4

Upvotes: 0

Related Questions