SeánMcK
SeánMcK

Reputation: 422

Create a binary based on three other variables

How can I programatically calculate desired_output?

The basic structure of my data frame is as follows:

airline<-c(0,0,1,0,0,1)
city1<-c('a','a','a','b','b','c')
city2<-c('b','c','d','c','d','d')
desired_output<-c(0,1,1,0,0,1)

mktdf<-data.frame(airline, city1, city2, desired_output)

The airline dummy indicates whether an airline flies between city1 and city2. In the case when it does not, I want to create a dummy that indicates that the airline still does fly from city1 and city2 (but, not between them).

For example, the airline does not fly BETWEEN a and b. It does however fly between a & d. On the other hand it never flies from city b. Thus the first row in desired_output =0.

In row 2 we observe 1 in desired_output. This is because, while we know the airline flies from city a and later we see it also flies from city (but again, not between them).

I'm happy to share any code I have written in attempting do solve this, though I was completely unsuccessful and I think it would just be distracting. However, broadly speaking I have tried using dpylr, looping and the transform function.

Upvotes: 1

Views: 99

Answers (2)

Onyambu
Onyambu

Reputation: 79238

a=paste0(city1,city2)

b=combn(unlist(strsplit(a[!!(airline)],"")),2,paste0,collapse="")

a%in%b+0L
[1] 0 1 1 0 0 1


mktdf$desired1=a%in%b+0L
> mktdf
  airline city1 city2 desired_output desired1
1       0     a     b              0        0
2       0     a     c              1        1
3       1     a     d              1        1
4       0     b     c              0        0
5       0     b     d              0        0
6       1     c     d              1        1

Upvotes: 0

gfgm
gfgm

Reputation: 3647

As a template of how to get to your desired output using igraph, some code below:

library(igraph)

airline<-c(0,0,1,0,0,1)
city1<-c('a','a','a','b','b','c')
city2<-c('b','c','d','c','d','d')
desired_output<-c(0,1,1,0,0,1)

mktdf<-data.frame(airline, city1, city2, desired_output)

g <- graph_from_data_frame(mktdf[mktdf$airline==1, 2:3], # your actual 
                                                         # connections,
                          directed = F, # I am assuming that 
                                        # connections are flights back 
                                        # AND FORTH
                          vertices = letters[1:4] # you need to 
                                                  # provide the list
                                                  # of vertices if some
                                                  # cities are unconnected
                          )
plot(g)

Now we get the components -- basically chop it into the connected bit and the unconnected node. I'll do this by decomposing it into two graphs, but depending on where you are going with your analysis you may want the components() function instead:

comps <- decompose(g, min.vertices = 1)
comps
#> [[1]]
#> IGRAPH 8dfe807 UN-- 3 2 -- 
#> + attr: name (v/c)
#> + edges from 8dfe807 (vertex names):
#> [1] a--d c--d
#> 
#> [[2]]
#> IGRAPH 5bb31f9 UN-- 1 0 -- 
#> + attr: name (v/c)
#> + edges from 5bb31f9 (vertex names):

We have two graphs now. You want an indicator that is equal to 1 if city1 and city2 in your df are in the same component and zero otherwise:

as.numeric(mktdf$city1 %in% names(V(comps[[1]])) & 
           mktdf$city2 %in% names(V(comps[[1]])))
#> [1] 0 1 1 0 0 1

Hooray, that's the desired output.

In this example we we knew which component is the one we were looking for by roughly eyeballing it. If you wanted to find that component among a list of components, you could check to see which component has your original edges in it

lapply(comps, function(x){all(E(g) %in% E(x))})
#> [[1]]
#> [1] TRUE
#> 
#> [[2]]
#> [1] FALSE

Here we see that the first sub.graph we've found is the one we wanted (this might matter if you have lots and lots of components. Another approach would be to take the largest component).

Upvotes: 1

Related Questions