Reputation: 51
I have data named as my_data
. The amount of data is > 100000. Sample output is like below:
id source
8166923397733625478 happimobiles
8166923397733625478 Springfit
7301100145962413274 Duroflex
6703062895304712434 happimobiles
6897156268457025524 themrphone
37564799155342281 Sangeetha Mobiles
1159098248970201145 Sangeetha Mobiles
Link to entire data:
https://docs.google.com/spreadsheets/d/1HUoRlVVf8EBedj1puXdgtTS6GGeFsXYqjVicUwbc5KE/edit#gid=0 for which I want the code to run.
I want a output where every id should be counted under source and also under which that is repeated
Ex ID ending with 5478 falls under both happimobiles
and springfit
. So happimobiles
has id 8166923397733625478 and 6703062895304712434 which makes it 2 and 1 is common with springfit
.
Output:
happimobiles Springfit Duroflex themrphone Sangeetha
happimobiles 2 1 0 0 0
Springfit 1 1 0 0 0
Duroflex 0 0 1 0 0
themrphone 0 0 0 1 0
Sangeetha 0 0 0 0 2
I have also tried:
Pivot <- dcast(my_data,source~source,value.var = "id",function(x) length((x)))
which is giving me only unique records in specific partner correctly but not overlaps.
I also tried:
crossprod(table(my_data))
But this does not give correct answer.
Any other solution which can hep me get this kind of output
Upvotes: 0
Views: 78
Reputation: 42564
This can be solved by a self-join and subsequent reshaping:
library(data.table)
setDT(my_data)[, source := forcats::fct_inorder(source)]
my_data[my_data, on = "id"][
, dcast(.SD, source ~ i.source, value.var = "id")]
Aggregate function missing, defaulting to 'length' source happimobiles Springfit Duroflex themrphone Sangeetha Mobiles 1: happimobiles 2 1 0 0 0 2: Springfit 1 1 0 0 0 3: Duroflex 0 0 1 0 0 4: themrphone 0 0 0 1 0 5: Sangeetha Mobiles 0 0 0 0 2
The call to forcats::fct_inorder()
is only required to reproduce OP's expected result where the columns are arranged in order of appearance instead of alphabetical order.
OP's production dataset with 10000 rows would produce a result of 85 columns when reshaped to wide format. Therefore, it might be more convenient to leave the data in long format:
library(data.table)
library(magrittr) # piping used to improve readability
googlesheets4::read_sheet(
"https://docs.google.com/spreadsheets/d/1HUoRlVVf8EBedj1puXdgtTS6GGeFsXYqjVicUwbc5KE/edit#gid=0",
col_types = "c") %>%
setDT() %>%
unique() %>%
.[., on = "FVID", allow.cartesian = TRUE] %>%
.[, .N, keyby = .(Partner, i.Partner)]
Partner i.Partner N 1: AOSmithIndia AOSmithIndia 21 2: Adityavision Adityavision 2 3: BajajElectronics BajajElectronics 128 4: BajajElectronics SangeethaMobiles 1 5: BajajElectronics happimobiles 3 6: BigC BigC 27 7: BigC BnewMobiles 1 8: BigC Celekt 2 9: BigC LotMobiles 2 10: BigC SangeethaMobiles 11 11: BigC Springfit 1 12: BigC happimobiles 3 13: Bluestar Bluestar 56 14: BnewMobiles BigC 1 15: BnewMobiles BnewMobiles 57 16: BnewMobiles LotMobiles 1 17: BnewMobiles SangeethaMobiles 2 18: Celekt BigC 2 19: Celekt Celekt 10 20: Celekt LotMobiles 1 21: Celekt SangeethaMobiles 1 22: Celekt happimobiles 1 23: ChennaiMobiles ChennaiMobiles 51 24: ChennaiMobiles LotMobiles 1 25: ChennaiMobiles SangeethaMobiles 12 26: ChennaiMobiles iPlanet 1 27: Creaticity Creaticity 7 28: Duroflex Duroflex 91 29: EdunGruru EdunGruru 2 30: FirefoxBikes FirefoxBikes 38 31: Greateastern Greateastern 12 32: ICA ICA 24 33: LG LG 2 34: LotMobiles BigC 2 35: LotMobiles BnewMobiles 1 36: LotMobiles Celekt 1 37: LotMobiles ChennaiMobiles 1 38: LotMobiles LotMobiles 30 39: LotMobiles SangeethaMobiles 14 40: LotMobiles happimobiles 3 41: NarayanNethralaya NarayanNethralaya 7 42: NipponPaint NipponPaint 4 43: Onida Onida 27 44: PoorvikaMobile PoorvikaMobile 2 45: Richfeel Richfeel 55 46: Richfeel VLCCWellness 1 47: SKMobile SKMobile 5 48: SKMobile mobilewalavadodara 1 49: SS_Communication SS_Communication 11 50: SS_Communication themrphone 1 51: SabkaDentist SabkaDentist 48 52: SangeethaMobiles BajajElectronics 1 53: SangeethaMobiles BigC 11 54: SangeethaMobiles BnewMobiles 2 55: SangeethaMobiles Celekt 1 56: SangeethaMobiles ChennaiMobiles 12 57: SangeethaMobiles LotMobiles 14 58: SangeethaMobiles SangeethaMobiles 1060 59: SangeethaMobiles happimobiles 12 60: SangeethaMobiles iPlanet 1 61: Springfit BigC 1 62: Springfit Springfit 123 63: Springfit happimobiles 1 64: TecQ TecQ 2 65: TheMaark TheMaark 31 66: TronxThings TronxThings 14 67: VLCCInstitute VLCCInstitute 36 68: VLCCWellness Richfeel 1 69: VLCCWellness VLCCWellness 27 70: VutsTV VutsTV 12 71: happimobiles BajajElectronics 3 72: happimobiles BigC 3 73: happimobiles Celekt 1 74: happimobiles LotMobiles 3 75: happimobiles SangeethaMobiles 12 76: happimobiles Springfit 1 77: happimobiles happimobiles 474 78: iPlanet ChennaiMobiles 1 79: iPlanet SangeethaMobiles 1 80: iPlanet iPlanet 15 81: mobilewalavadodara SKMobile 1 82: mobilewalavadodara mobilewalavadodara 3 83: poojaratele poojaratele 24 84: themrphone SS_Communication 1 85: themrphone themrphone 85 Partner i.Partner N
The steps are as follows:
data.table
FVID
s. Usage of keyby =
requests the result to be ordered alphabetically.Upvotes: 0
Reputation: 102181
I think you are already very close to what you are expecting when using crossprod()
. Here is a base R solution, and you can try it
r <- crossprod(table(df)[,match(unique(df$source),colnames(table(df)))])
which gives
> r
source
source happimobiles Springfit Duroflex themrphone SangeethaMobiles
happimobiles 2 1 0 0 0
Springfit 1 1 0 0 0
Duroflex 0 0 1 0 0
themrphone 0 0 0 1 0
SangeethaMobiles 0 0 0 0 2
Upvotes: 1
Reputation: 21729
Not sure about your logic, but hope this gives you some idea to move forward:
library(data.table)
# set as data.table
setDT(df)
# get counts by source
df = df[,.N,source]
# create a duplicate column (a trick to make dcast work)
df$source2 <- df$source
# convert shape using dcast
dcast(df, source ~ source2, value.var="N", fun.aggregate=sum)
Upvotes: 2
Reputation: 389095
In base R, you can get the unique source in the dataframe, use lapply
find the id
's present and count their frequency using table
.
unique_source <- unique(df$source)
cbind.data.frame(unique_source, do.call(rbind, lapply(unique_source, function(x)
table(df$source[df$id %in% df$id[df$source == x]]))))
# unique_source Duroflex happimobiles SangeethaMobiles Springfit themrphone
#1 happimobiles 0 2 0 1 0
#2 Springfit 0 1 0 1 0
#3 Duroflex 1 0 0 0 0
#4 themrphone 0 0 0 0 1
#5 SangeethaMobiles 0 0 2 0 0
data
df <- structure(list(id = c(8166923397733625856, 8166923397733625856,
7301100145962413056, 6703062895304712192, 6897156268457025536,
37564799155342280, 1159098248970201088), source = structure(c(2L,
4L, 1L, 2L, 5L, 3L, 3L), .Label = c("Duroflex", "happimobiles",
"SangeethaMobiles", "Springfit", "themrphone"), class = "factor")),
class = "data.frame", row.names = c(NA, -7L))
Upvotes: 1