Reputation: 5656
I have a data frame like this:
foo=data.frame(Point.Type = c("Zero Start","Zero Start", "Zero Start", "3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww","3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww","3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww","Zero Stop","Zero Start"),
Point.Value = c(NA,NA,NA,rnorm(3),NA,NA))
I want to add three columns, by splitting the first column with separator _
, and retain only the numeric values obtained after the split. For those rows where the first column doesn't contain any _
, the three new columns should be NA
. I got somewhat close using separate
, but that's not enough:
> library(tidyr)
> bar = separate(foo,Point.Type, c("rpm_nom", "GVF_nom", "p0in_nom"), sep="_", remove = FALSE, extra="drop", fill="right")
> bar
Point.Type rpm_nom GVF_nom p0in_nom Point.Value
1 Zero Start Zero Start <NA> <NA> NA
2 Zero Start Zero Start <NA> <NA> NA
3 Zero Start Zero Start <NA> <NA> NA
4 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000rpm 10% 13barG -1.468033
5 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000rpm 10% 13barG 1.280868
6 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000rpm 10% 13barG 0.270126
7 Zero Stop Zero Stop <NA> <NA> NA
8 Zero Start Zero Start <NA> <NA> NA
I'm not sure why my data frame contains now two apparently different kinds of NA
, but is.na
seems to like them both, so I can live with that. However, I have two kind of problems:
numeric
, and possibly integer
. Instead they're character
, because of the trailing rpm
, %
, barG
. How do I get rid of those?Point.Type
can't be split, rpm_nom
should be NA
, instead it becomes Zero Start
or Zero Stop
. Changing the fill=
option only changes which one of the new columns get the Zero Start
/Zero Stop
. Instead I want all three of them to be NA
. How can I do that?NOTE: I'm using tidyr
, but of course you don't need to, if you think there's a better way to do this.
Upvotes: 2
Views: 115
Reputation: 83275
You can post-process the columns with dplyr:
library(dplyr)
foo <- foo %>%
separate(Point.Type, c("rpm_nom", "GVF_nom", "p0in_nom"),
sep="_", remove = FALSE, extra="drop", fill="right") %>%
mutate_each(funs(as.numeric(gsub("[^0-9]","",.))), rpm_nom, GVF_nom, p0in_nom)
The gsub("[^0-9]","",.)
-part removes all non-numeric characters. If you want to prevent the removal of decimal points, you can use [^0-9.]
instead of [^0-9]
(like @PierreLafortune used in his answer), but be aware that this will also include points that are not meant to be decimal points. By wrapping it in as.numeric
, you convert them to numeric values while at the same time transforming the empty cells to NA
. This gives the following result:
> foo
Point.Type rpm_nom GVF_nom p0in_nom Point.Value
1 Zero Start NA NA NA NA
2 Zero Start NA NA NA NA
3 Zero Start NA NA NA NA
4 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000 10 13 -1.2361145
5 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000 10 13 -0.8727960
6 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000 10 13 0.9685555
7 Zero Stop NA NA NA NA
8 Zero Start NA NA NA NA
Or using data.table (as contributed by @DavidArenburg in the comments):
library(data.table)
setDT(foo)[, c("rpm_nom","GVF_nom","p0in_nom") :=
lapply(tstrsplit(Point.Type, "_", fixed = TRUE)[1:3],
function(x) as.numeric(gsub("[^0-9]","",x)))
]
will give a similar result:
> foo
Point.Type Point.Value rpm_nom GVF_nom p0in_nom
1: Zero Start NA NA NA NA
2: Zero Start NA NA NA NA
3: Zero Start NA NA NA NA
4: 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww -0.09255445 3000 10 13
5: 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 1.18581340 3000 10 13
6: 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 2.14475950 3000 10 13
7: Zero Stop NA NA NA NA
8: Zero Start NA NA NA NA
The advantage of this is that foo
is updated by reference. As this is faster and more memory efficient, this is especially valuable for using with large datasets.
Upvotes: 2
Reputation: 28461
With base R
we can first coerce NA
values where necessary and coerce class numeric
:
bar[-1] <- lapply(bar[-1], function(x) {
is.na(x) <- grepl("Zero", x)
as.numeric(gsub("[^0-9.]", "", x))})
# Point.Type rpm_nom GVF_nom p0in_nom Point.Value
# 1 Zero Start NA NA NA NA
# 2 Zero Start NA NA NA NA
# 3 Zero Start NA NA NA NA
# 4 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000 10 13 0.3558397
# 5 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000 10 13 1.1454829
# 6 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000 10 13 0.2958815
# 7 Zero Stop NA NA NA NA
# 8 Zero Start NA NA NA NA
To reduce to one line (@Jaap):
bar[-1] <- lapply(bar[-1], function(x) as.numeric(gsub("[^0-9.]", "", x)))
Upvotes: 1