Reputation: 141
Folllow up on question Searching for unique values in dataframe and creating a table with them
Here is how my data looks like
UUID Source
1 Jane http//mywebsite.com44bb00?utm_source=ADW&utm_medium=banner&utm_campaign=Monk&gclid1234
2 Mike http//mywebsite.com44bb00?utm_source=Google&utm_medium=cpc&utm_campaign=DOG&gclid1234
3 John http//mywebsite.com44bb00?utm_source=Yahoo&utm_medium=banner&utm_campaign=DOG&gclid1234
4 Sarah http//mywebsite.com44bb00?utm_source=Facebookdw&utm_medium=cpc&utm_campaign=CAT&gclid1234
5 Michael http//mywebsite.com44bb00?utm_source=Twitter&utm_medium=GDNr&utm_campaign=CAT&gclid1234
6 Bob http//mywebsite.com44bb00?utm_source=ADW&utm_medium=GDN&utm_campaign=DOG&gclid1234
7 Mark http//mywebsite.com44bb00?utm_source=Twitter&utm_medium=banner&utm_campaign=MONK&gclid1234
8 Anna http//mywebsite.com44bb00?utm_source=Facebook&utm_medium=banner&utm_campaign=MONK&gclid1234
And here is the desired output of what I am trying to achieve
NAME UTM_SOURCE UTM_MEDIUM UTM_CAMPAIGN
1 Jane ADW banner Monk
2 Mike Google cpc DOG
3 John Yahoo banner DOG
4 Sarah Faceboo cpc CAT
5 Michael Twitter GDN CAT
6 Bob ADW GDN DOG
7 Mark Twitter banner MONK
8 Anna Facebook banner MONK
So in other words what I want is to obtain a specific piece of information based on a criteria. Example: search in the dataframe for the value "utmsource=" and when found, copy whatever information is found between the "=" and "&" signs. In the case of User no1 (Jame) if you look in the original file, her Source URL contains the value "utm_source=ADW". In the output file, the "ADW" bit is extracted and imputed in a new column named "utm_source". Same principle for all other users and other dimmensions (utm_medium & utm_campaign)
I understand that the function gsub
can help me. Here is what I have tried so far:
> file1 <- read.csv("C:/Users/Dumitru Ostaciu/Desktop/Users.csv")
> file1 <- transform(file1, Source = as.character(Source))
> file2 <- gsub(".*\\?utm_source=", "", file1$Source)
And this is the result I got
UUID SOURCE
1 ADW&utm_medium=banner&utm_campaign=Monk&gclid1234
2 Google&utm_medium=cpc&utm_campaign=DOG&gclid1234
3 Yahoo&utm_medium=banner&utm_campaign=DOG&gclid1234
4 Facebookdw&utm_medium=cpc&utm_campaign=CAT&gclid1234
5 Twitter&utm_medium=GDNr&utm_campaign=CAT&gclid1234
6 ADW&utm_medium=GDN&utm_campaign=DOG&gclid1234
7 Twitter&utm_medium=banner&utm_campaign=MONK&gclid1234
8 Facebook&utm_medium=banner&utm_campaign=MONK&gclid1234
I have 2 questions about this:
1) In the output that I got, the function copied everything that followed the value "utm_source-" . How do I add another dimension to make the formula copy only what is between "=" and "&"
2) How do i keep the values that were initially in the first column (UUID) , Jane, Mike, John, etc?
Upvotes: 0
Views: 1680
Reputation: 141
Here is how I did it
> file1 <- read.csv("C:/Users/Dumitru Ostaciu/Desktop/Users.csv")
> file1 <- transform(file1, Source = as.character(Source))
> z <- matrix(
unlist(strsplit(gsub(".*\\?", "", file1$Source), "\\&")),
ncol=4, byrow=TRUE)
> file2 <- cbind(file1$UUID, gsub(".*=", "", z))
This is the result that I got
V1 V2 V3 V4 V5
1 3 ADW banner Monk gclid1234
2 7 Google cpc DOG gclid1234
3 4 Yahoo banner DOG gclid1234
4 8 Facebookdw cpc CAT gclid1234
5 6 Twitter GDNr CAT gclid1234
6 2 ADW GDN DOG gclid1234
7 5 Twitter banner MONK gclid1234
8 1 Facebook banner MONK gclid1234
I need to point out that my real data will have 500.000 rows and in the first column there will be a unique designation.
How do I fix this to have the names show up in V1? What was my mistake?
Upvotes: 1
Reputation: 179438
You need to do two things:
gsub
to strip the website name from your Sourcestrsplit
to separate the remaining string at each occurrence of ?
Read in the data:
x <- read.table(text="
UUID Source
1 Jane http//mywebsite.com44bb00?utm_source=ADW&utm_medium=banner&utm_campaign=Monk&gclid1234
2 Mike http//mywebsite.com44bb00?utm_source=Google&utm_medium=cpc&utm_campaign=DOG&gclid1234
3 John http//mywebsite.com44bb00?utm_source=Yahoo&utm_medium=banner&utm_campaign=DOG&gclid1234
4 Sarah http//mywebsite.com44bb00?utm_source=Facebookdw&utm_medium=cpc&utm_campaign=CAT&gclid1234
5 Michael http//mywebsite.com44bb00?utm_source=Twitter&utm_medium=GDNr&utm_campaign=CAT&gclid1234
6 Bob http//mywebsite.com44bb00?utm_source=ADW&utm_medium=GDN&utm_campaign=DOG&gclid1234
7 Mark http//mywebsite.com44bb00?utm_source=Twitter&utm_medium=banner&utm_campaign=MONK&gclid1234
8 Anna http//mywebsite.com44bb00?utm_source=Facebook&utm_medium=banner&utm_campaign=MONK&gclid1234", header=TRUE, stringsAsFactors=FALSE)
Use strsplit
to separate the Source string at each ?
:
z <- matrix(
unlist(strsplit(gsub(".*\\?", "", x$Source), "\\&")),
ncol=4, byrow=TRUE)
cbind(x$UUID, gsub(".*=", "", z))
[,1] [,2] [,3] [,4] [,5]
[1,] "Jane" "ADW" "banner" "Monk" "gclid1234"
[2,] "Mike" "Google" "cpc" "DOG" "gclid1234"
[3,] "John" "Yahoo" "banner" "DOG" "gclid1234"
[4,] "Sarah" "Facebookdw" "cpc" "CAT" "gclid1234"
[5,] "Michael" "Twitter" "GDNr" "CAT" "gclid1234"
[6,] "Bob" "ADW" "GDN" "DOG" "gclid1234"
[7,] "Mark" "Twitter" "banner" "MONK" "gclid1234"
[8,] "Anna" "Facebook" "banner" "MONK" "gclid1234"
And then convert to a data frame and add names:
z <- matrix(
unlist(strsplit(gsub(".*\\?", "", x$Source), "\\&")),
ncol=4, byrow=TRUE)
z <- cbind(x$UUID, gsub(".*=", "", z))
z <- as.data.frame(z[, -5])
names(z) <- c("UUID", "UTM_SOURCE", "UTM_MEDIUM", "UTM_CAMPAIGN")
z
UUID UTM_SOURCE UTM_MEDIUM UTM_CAMPAIGN
1 Jane ADW banner Monk
2 Mike Google cpc DOG
3 John Yahoo banner DOG
4 Sarah Facebookdw cpc CAT
5 Michael Twitter GDNr CAT
6 Bob ADW GDN DOG
7 Mark Twitter banner MONK
8 Anna Facebook banner MONK
Upvotes: 1