DeltaIV
DeltaIV

Reputation: 5656

Create multiple columns from a single column and clean up results

I have a data frame like this:

foo=data.frame(Point.Type = c("Zero Start","Zero Start", "Zero Start", "3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww","3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww","3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww","Zero Stop","Zero Start"),
               Point.Value = c(NA,NA,NA,rnorm(3),NA,NA))

I want to add three columns, by splitting the first column with separator _, and retain only the numeric values obtained after the split. For those rows where the first column doesn't contain any _, the three new columns should be NA. I got somewhat close using separate, but that's not enough:

> library(tidyr) 
> bar = separate(foo,Point.Type, c("rpm_nom", "GVF_nom", "p0in_nom"), sep="_", remove = FALSE, extra="drop", fill="right")
> bar
                            Point.Type    rpm_nom GVF_nom p0in_nom Point.Value
1                           Zero Start Zero Start    <NA>     <NA>          NA
2                           Zero Start Zero Start    <NA>     <NA>          NA
3                           Zero Start Zero Start    <NA>     <NA>          NA
4 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww    3000rpm     10%   13barG   -1.468033
5 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww    3000rpm     10%   13barG    1.280868
6 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww    3000rpm     10%   13barG    0.270126
7                            Zero Stop  Zero Stop    <NA>     <NA>          NA
8                           Zero Start Zero Start    <NA>     <NA>          NA

I'm not sure why my data frame contains now two apparently different kinds of NA, but is.na seems to like them both, so I can live with that. However, I have two kind of problems:

  1. the new columns should be at least numeric, and possibly integer. Instead they're character, because of the trailing rpm, %, barG. How do I get rid of those?
  2. when Point.Type can't be split, rpm_nom should be NA, instead it becomes Zero Start or Zero Stop. Changing the fill= option only changes which one of the new columns get the Zero Start/Zero Stop. Instead I want all three of them to be NA. How can I do that?

NOTE: I'm using tidyr, but of course you don't need to, if you think there's a better way to do this.

Upvotes: 2

Views: 115

Answers (2)

Jaap
Jaap

Reputation: 83275

You can post-process the columns with dplyr:

library(dplyr)
foo <- foo %>%
  separate(Point.Type, c("rpm_nom", "GVF_nom", "p0in_nom"), 
           sep="_", remove = FALSE, extra="drop", fill="right") %>%
  mutate_each(funs(as.numeric(gsub("[^0-9]","",.))), rpm_nom, GVF_nom, p0in_nom)

The gsub("[^0-9]","",.)-part removes all non-numeric characters. If you want to prevent the removal of decimal points, you can use [^0-9.] instead of [^0-9] (like @PierreLafortune used in his answer), but be aware that this will also include points that are not meant to be decimal points. By wrapping it in as.numeric, you convert them to numeric values while at the same time transforming the empty cells to NA. This gives the following result:

> foo
                            Point.Type rpm_nom GVF_nom p0in_nom Point.Value
1                           Zero Start      NA      NA       NA          NA
2                           Zero Start      NA      NA       NA          NA
3                           Zero Start      NA      NA       NA          NA
4 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww    3000      10       13  -1.2361145
5 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww    3000      10       13  -0.8727960
6 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww    3000      10       13   0.9685555
7                            Zero Stop      NA      NA       NA          NA
8                           Zero Start      NA      NA       NA          NA

Or using data.table (as contributed by @DavidArenburg in the comments):

library(data.table)
setDT(foo)[, c("rpm_nom","GVF_nom","p0in_nom") := 
             lapply(tstrsplit(Point.Type, "_", fixed = TRUE)[1:3],
                    function(x) as.numeric(gsub("[^0-9]","",x)))
           ]

will give a similar result:

> foo
                             Point.Type Point.Value rpm_nom GVF_nom p0in_nom
1:                           Zero Start          NA      NA      NA       NA
2:                           Zero Start          NA      NA      NA       NA
3:                           Zero Start          NA      NA      NA       NA
4: 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww -0.09255445    3000      10       13
5: 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww  1.18581340    3000      10       13
6: 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww  2.14475950    3000      10       13
7:                            Zero Stop          NA      NA      NA       NA
8:                           Zero Start          NA      NA      NA       NA

The advantage of this is that foo is updated by reference. As this is faster and more memory efficient, this is especially valuable for using with large datasets.

Upvotes: 2

Pierre L
Pierre L

Reputation: 28461

With base R we can first coerce NA values where necessary and coerce class numeric:

bar[-1] <- lapply(bar[-1], function(x) {
  is.na(x) <- grepl("Zero", x)
  as.numeric(gsub("[^0-9.]", "", x))})
#                             Point.Type rpm_nom GVF_nom p0in_nom Point.Value
# 1                           Zero Start      NA      NA       NA          NA
# 2                           Zero Start      NA      NA       NA          NA
# 3                           Zero Start      NA      NA       NA          NA
# 4 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww    3000      10       13   0.3558397
# 5 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww    3000      10       13   1.1454829
# 6 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww    3000      10       13   0.2958815
# 7                            Zero Stop      NA      NA       NA          NA
# 8                           Zero Start      NA      NA       NA          NA

To reduce to one line (@Jaap):

bar[-1] <- lapply(bar[-1], function(x) as.numeric(gsub("[^0-9.]", "", x)))

Upvotes: 1

Related Questions