Reputation: 63
Let's assume that I have the following column in my dataframe df$characteristics, which has the following format:
sex,male,country,GER,age,25,date_of_birth,1994-04-04
sex,female,age,23,date_of_birth,1996-04-04
country,GBR,age,45,date_of_birth,1974-04-04
country,RUS,date_of_birth,1978-10-04
Essentially, every odd word is the column name and every even word is the value of that column. What I would like to do is to create 4 different columns: sex, country, age and date of birth and fill those columns where possible with respective values (some values will be empty). How can I do it in R?
Upvotes: 0
Views: 56
Reputation: 269526
Using characteristics
shown reproducibly in the Note at the end, replace comma string comma or comma string end of line with colon string newline. The data is now in dcf format so use read.dcf
to read it in. Note that we replace each empty element of the input with Z,0
as dcf format requires at least one element. At the end we remove the Z
column.
dcf <- gsub(",(.*?)(,|$)", ":\\1\n",
replace(characteristics, characteristics == "", "Z,0"))
m <- read.dcf(textConnection(dcf))
m <- m[, colnames(m) != "Z"]
m
giving this character matrix:
sex country age date_of_birth
[1,] "male" "GER" "25" "1994-04-04"
[2,] "female" NA "23" "1996-04-04"
[3,] NA "GBR" "45" "1974-04-04"
[4,] NA NA NA NA
[5,] NA "RUS" NA "1978-10-04"
Note that we have added an empty row to illustrate that that is handled.
characteristics <- c("sex,male,country,GER,age,25,date_of_birth,1994-04-04",
"sex,female,age,23,date_of_birth,1996-04-04",
"country,GBR,age,45,date_of_birth,1974-04-04",
"",
"country,RUS,date_of_birth,1978-10-04")
Upvotes: 1