webNash
webNash

Reputation: 301

Help needed in Data cleaning using R

   "id","gender","age","category1","category2","category3","category4","category5","category6","category7","category8","category9","category10"
1,"Male",22,"movies","music","travel","cloths","grocery",,,,,
2,"Male",28,"travel","books","movies",,,,,,,
3,"Female",27,"rent","fuel","grocery","cloths",,,,,,
4,"Female",22,"rent","grocery","travel","movies","cloths",,,,,
5,"Female",22,"rent","online-shopping","utiliy",,,,,,,

I need to reformat as follows.

id gender age category            rank
1 Male    22  movies               1
1 Male    22  music                2
1 Male    22  travel               3
1 Male    22  cloths               4
1 Male    22  grocery              5
1 Male    22  books                NA
1 Male    22  rent                 NA
1 Male    22  fuel                 NA
1 Male    22  utility              NA
1 Male    22  online-shopping      NA

So far my efforts are as follows.

mini <- read.csv("coding/mini.csv", header=FALSE)
mini_clean <- mini[-1,]
df_mini <- melt(df_clean, id.vars=c("V1","V2","V3"))
sqldf('select * from df_mini order by  "V1"')

Now I want to know what is the best way to fill all missing categories for each user. Any help in this regard is appreciated.

Upvotes: 1

Views: 134

Answers (2)

Parfait
Parfait

Reputation: 107767

Consider using the base function reshape as this is the regular example of wide to long dataset reshaping/pivoting:

reshapedf <- reshape(df, varying = c(4:13), 
                     v.names = c("category"),
                     timevar=c("rank"), 
                     times = c(1:10),
                     idvar = c("id", "gender", "age"), 
                     new.row.names = 1:1000,
                     direction = "long")

# ORDER RESULTING DATA FRAME
reshapedf <- reshapedf[with(reshapedf , order(id, gender, age)), ]
# RESET ROW NAMES
row.names(reshapedf) <- 1:nrow(reshapedf)

OUTPUT

        id      gender      age     rank    category
1       1       Male        22      1       movies
2       1       Male        22      2       music
3       1       Male        22      3       travel
4       1       Male        22      4       cloths
5       1       Male        22      5       grocery
6       1       Male        22      6       NA
7       1       Male        22      7       NA
8       1       Male        22      8       NA
9       1       Male        22      9       NA
10      1       Male        22      10      NA
...

Upvotes: 1

Pierre L
Pierre L

Reputation: 28461

library(reshape2)
library(tidyr)

mdf <- melt(df, c("id","gender","age"))
complete(na.omit(mdf), c(id, gender, age), value)
# Source: local data frame [50 x 5]
# 
# id gender   age           value  variable
# (int) (fctr) (int)           (chr)    (fctr)
# 1      1   Male    22           books        NA
# 2      1   Male    22          cloths category4
# 3      1   Male    22            fuel        NA
# 4      1   Male    22         grocery category5
# 5      1   Male    22          movies category1
# 6      1   Male    22           music category2
# 7      1   Male    22 online-shopping        NA
# 8      1   Male    22            rent        NA
# 9      1   Male    22          travel category3
# 10     1   Male    22          utiliy        NA
# ..   ...    ...   ...             ...       ...

Explanation

We can first melt the data.frame specifying the id columns. Next, the new release of tidyr has a helper function complete to expand columns as your output describes.

Data

df <- read.csv(text='"id","gender","age","category1","category2","category3","category4","category5","category6","category7","category8","category9","category10"
1,"Male",22,"movies","music","travel","cloths","grocery",,,,,
2,"Male",28,"travel","books","movies",,,,,,,
3,"Female",27,"rent","fuel","grocery","cloths",,,,,,
4,"Female",22,"rent","grocery","travel","movies","cloths",,,,,
5,"Female",22,"rent","online-shopping","utiliy",,,,,,,')
is.na(df) <- is.na(df) | df== ""

Upvotes: 2

Related Questions