Reputation: 51

Data Concurrency within variable

I have data named as my_data. The amount of data is > 100000. Sample output is like below:

    id                      source
    8166923397733625478 happimobiles
    8166923397733625478 Springfit
    7301100145962413274 Duroflex
    6703062895304712434 happimobiles
    6897156268457025524 themrphone
    37564799155342281   Sangeetha Mobiles
    1159098248970201145 Sangeetha Mobiles

Link to entire data:

https://docs.google.com/spreadsheets/d/1HUoRlVVf8EBedj1puXdgtTS6GGeFsXYqjVicUwbc5KE/edit#gid=0 for which I want the code to run.

I want a output where every id should be counted under source and also under which that is repeated Ex ID ending with 5478 falls under both happimobiles and springfit. So happimobiles has id 8166923397733625478 and 6703062895304712434 which makes it 2 and 1 is common with springfit.

Output:

                   happimobiles   Springfit  Duroflex themrphone   Sangeetha    
happimobiles         2                1        0          0            0
Springfit            1                1        0          0            0
Duroflex             0                0        1          0            0  
themrphone           0                0        0          1            0
Sangeetha            0                0        0          0            2

I have also tried:

Pivot <- dcast(my_data,source~source,value.var = "id",function(x) length((x)))

which is giving me only unique records in specific partner correctly but not overlaps.

I also tried:

crossprod(table(my_data))

But this does not give correct answer.

Any other solution which can hep me get this kind of output

Upvotes: 0

Answers (4)

Uwe

Reputation: 42564

This can be solved by a self-join and subsequent reshaping:

library(data.table)
setDT(my_data)[, source := forcats::fct_inorder(source)]
my_data[my_data, on = "id"][
  , dcast(.SD, source ~ i.source, value.var = "id")]

Aggregate function missing, defaulting to 'length'
              source happimobiles Springfit Duroflex themrphone Sangeetha Mobiles
1:      happimobiles            2         1        0          0                 0
2:         Springfit            1         1        0          0                 0
3:          Duroflex            0         0        1          0                 0
4:        themrphone            0         0        0          1                 0
5: Sangeetha Mobiles            0         0        0          0                 2

The call to forcats::fct_inorder() is only required to reproduce OP's expected result where the columns are arranged in order of appearance instead of alphabetical order.

OP's production dataset with 10000 rows would produce a result of 85 columns when reshaped to wide format. Therefore, it might be more convenient to leave the data in long format:

library(data.table)
library(magrittr) # piping used to improve readability
googlesheets4::read_sheet(
  "https://docs.google.com/spreadsheets/d/1HUoRlVVf8EBedj1puXdgtTS6GGeFsXYqjVicUwbc5KE/edit#gid=0",
  col_types = "c") %>% 
  setDT() %>% 
  unique() %>% 
  .[., on = "FVID", allow.cartesian = TRUE] %>% 
  .[, .N, keyby = .(Partner, i.Partner)]

               Partner          i.Partner    N
 1:       AOSmithIndia       AOSmithIndia   21
 2:       Adityavision       Adityavision    2
 3:   BajajElectronics   BajajElectronics  128
 4:   BajajElectronics   SangeethaMobiles    1
 5:   BajajElectronics       happimobiles    3
 6:               BigC               BigC   27
 7:               BigC        BnewMobiles    1
 8:               BigC             Celekt    2
 9:               BigC         LotMobiles    2
10:               BigC   SangeethaMobiles   11
11:               BigC          Springfit    1
12:               BigC       happimobiles    3
13:           Bluestar           Bluestar   56
14:        BnewMobiles               BigC    1
15:        BnewMobiles        BnewMobiles   57
16:        BnewMobiles         LotMobiles    1
17:        BnewMobiles   SangeethaMobiles    2
18:             Celekt               BigC    2
19:             Celekt             Celekt   10
20:             Celekt         LotMobiles    1
21:             Celekt   SangeethaMobiles    1
22:             Celekt       happimobiles    1
23:     ChennaiMobiles     ChennaiMobiles   51
24:     ChennaiMobiles         LotMobiles    1
25:     ChennaiMobiles   SangeethaMobiles   12
26:     ChennaiMobiles            iPlanet    1
27:         Creaticity         Creaticity    7
28:           Duroflex           Duroflex   91
29:          EdunGruru          EdunGruru    2
30:       FirefoxBikes       FirefoxBikes   38
31:       Greateastern       Greateastern   12
32:                ICA                ICA   24
33:                 LG                 LG    2
34:         LotMobiles               BigC    2
35:         LotMobiles        BnewMobiles    1
36:         LotMobiles             Celekt    1
37:         LotMobiles     ChennaiMobiles    1
38:         LotMobiles         LotMobiles   30
39:         LotMobiles   SangeethaMobiles   14
40:         LotMobiles       happimobiles    3
41:  NarayanNethralaya  NarayanNethralaya    7
42:        NipponPaint        NipponPaint    4
43:              Onida              Onida   27
44:     PoorvikaMobile     PoorvikaMobile    2
45:           Richfeel           Richfeel   55
46:           Richfeel       VLCCWellness    1
47:           SKMobile           SKMobile    5
48:           SKMobile mobilewalavadodara    1
49:   SS_Communication   SS_Communication   11
50:   SS_Communication         themrphone    1
51:       SabkaDentist       SabkaDentist   48
52:   SangeethaMobiles   BajajElectronics    1
53:   SangeethaMobiles               BigC   11
54:   SangeethaMobiles        BnewMobiles    2
55:   SangeethaMobiles             Celekt    1
56:   SangeethaMobiles     ChennaiMobiles   12
57:   SangeethaMobiles         LotMobiles   14
58:   SangeethaMobiles   SangeethaMobiles 1060
59:   SangeethaMobiles       happimobiles   12
60:   SangeethaMobiles            iPlanet    1
61:          Springfit               BigC    1
62:          Springfit          Springfit  123
63:          Springfit       happimobiles    1
64:               TecQ               TecQ    2
65:           TheMaark           TheMaark   31
66:        TronxThings        TronxThings   14
67:      VLCCInstitute      VLCCInstitute   36
68:       VLCCWellness           Richfeel    1
69:       VLCCWellness       VLCCWellness   27
70:             VutsTV             VutsTV   12
71:       happimobiles   BajajElectronics    3
72:       happimobiles               BigC    3
73:       happimobiles             Celekt    1
74:       happimobiles         LotMobiles    3
75:       happimobiles   SangeethaMobiles   12
76:       happimobiles          Springfit    1
77:       happimobiles       happimobiles  474
78:            iPlanet     ChennaiMobiles    1
79:            iPlanet   SangeethaMobiles    1
80:            iPlanet            iPlanet   15
81: mobilewalavadodara           SKMobile    1
82: mobilewalavadodara mobilewalavadodara    3
83:        poojaratele        poojaratele   24
84:         themrphone   SS_Communication    1
85:         themrphone         themrphone   85
               Partner          i.Partner    N

The steps are as follows:

Read the production dataset from Google Docs
Coerce to data.table
Remove duplicate entries (assuming that this is in line with OP's intention)
Perform self-join
Count number of overlapping (as well as non-overlapping) FVIDs. Usage of keyby = requests the result to be ordered alphabetically.

Upvotes: 0

ThomasIsCoding

Reputation: 102181

I think you are already very close to what you are expecting when using crossprod(). Here is a base R solution, and you can try it

r <- crossprod(table(df)[,match(unique(df$source),colnames(table(df)))])

which gives

> r
                  source
source             happimobiles Springfit Duroflex themrphone SangeethaMobiles
  happimobiles                2         1        0          0                0
  Springfit                   1         1        0          0                0
  Duroflex                    0         0        1          0                0
  themrphone                  0         0        0          1                0
  SangeethaMobiles            0         0        0          0                2

Upvotes: 1

YOLO

Reputation: 21729

Not sure about your logic, but hope this gives you some idea to move forward:

library(data.table)

# set as data.table
setDT(df)

# get counts by source
df = df[,.N,source]

# create a duplicate column (a trick to make dcast work)
df$source2 <- df$source

# convert shape using dcast
dcast(df, source ~ source2, value.var="N", fun.aggregate=sum)

Upvotes: 2

Ronak Shah

Reputation: 389095

In base R, you can get the unique source in the dataframe, use lapply find the id's present and count their frequency using table.

unique_source <- unique(df$source)

cbind.data.frame(unique_source, do.call(rbind, lapply(unique_source, function(x) 
                table(df$source[df$id %in% df$id[df$source == x]]))))


#     unique_source Duroflex happimobiles SangeethaMobiles Springfit themrphone
#1     happimobiles        0            2                0         1          0
#2        Springfit        0            1                0         1          0
#3         Duroflex        1            0                0         0          0
#4       themrphone        0            0                0         0          1
#5 SangeethaMobiles        0            0                2         0          0

data

df <- structure(list(id = c(8166923397733625856, 8166923397733625856, 
7301100145962413056, 6703062895304712192, 6897156268457025536, 
37564799155342280, 1159098248970201088), source = structure(c(2L, 
4L, 1L, 2L, 5L, 3L, 3L), .Label = c("Duroflex", "happimobiles", 
"SangeethaMobiles", "Springfit", "themrphone"), class = "factor")), 
class = "data.frame", row.names = c(NA, -7L))

Upvotes: 1

Data Concurrency within variable

Answers (4)

Related Questions