Banker
Banker

Reputation: 63

Parse a column that contains both column names and values for those columns

Let's assume that I have the following column in my dataframe df$characteristics, which has the following format:

sex,male,country,GER,age,25,date_of_birth,1994-04-04

sex,female,age,23,date_of_birth,1996-04-04

country,GBR,age,45,date_of_birth,1974-04-04

country,RUS,date_of_birth,1978-10-04

Essentially, every odd word is the column name and every even word is the value of that column. What I would like to do is to create 4 different columns: sex, country, age and date of birth and fill those columns where possible with respective values (some values will be empty). How can I do it in R?

Upvotes: 0

Views: 56

Answers (1)

G. Grothendieck
G. Grothendieck

Reputation: 269526

Using characteristics shown reproducibly in the Note at the end, replace comma string comma or comma string end of line with colon string newline. The data is now in dcf format so use read.dcf to read it in. Note that we replace each empty element of the input with Z,0 as dcf format requires at least one element. At the end we remove the Z column.

dcf <- gsub(",(.*?)(,|$)", ":\\1\n", 
  replace(characteristics, characteristics == "", "Z,0"))
m <- read.dcf(textConnection(dcf))
m <- m[, colnames(m) != "Z"]
m

giving this character matrix:

     sex      country age  date_of_birth
[1,] "male"   "GER"   "25" "1994-04-04" 
[2,] "female" NA      "23" "1996-04-04" 
[3,] NA       "GBR"   "45" "1974-04-04" 
[4,] NA       NA      NA   NA           
[5,] NA       "RUS"   NA   "1978-10-04" 

Note

Note that we have added an empty row to illustrate that that is handled.

characteristics <- c("sex,male,country,GER,age,25,date_of_birth,1994-04-04", 
    "sex,female,age,23,date_of_birth,1996-04-04", 
    "country,GBR,age,45,date_of_birth,1974-04-04", 
    "",
    "country,RUS,date_of_birth,1978-10-04")

Upvotes: 1

Related Questions