Reputation: 100
First off, I am learning to use dplyr after having used base-r for most of my career (not really a data analyst, but trying to learn). I don't know if dplyr is the best option for this, or if I should use something else.
I have a data file generated by a piece of equipment that is very messy. There are header/tombstone data embedded within the data (time/date/location/sensor data for a specific location between rows of data for that location). The files are relatively large (150,000 observations x 14 variables), and I have successfully used dplyr to separate the actual data from the tombstone data (tombstone data has 6 rows of information spread over the 14 columns).
I am trying to create a single row of the tombstone information to append to the actual measurements so that it can be easily readable in R for analysis without relying on a "blackbox" solution from the manufacturer.
a sample of the data file and my script is provided below:
# Read csv file of data into R
data <- read_csv("data.csv", col_names = FALSE)
data
# A tibble: 155,538 x 14
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA 80.00 19.00 0.00 37.0 1.0 0.0 3.00 NA NA NA NA NA NA
2 1.4e+01 8.00 6.00 13.00 43.0 9.0 33.0 50.00 1.00 -1.60 -2.00 50.10 14.88 NA
3 5.9e-01 5.15 2.02 -0.57 0.0 0.0 0.0 0.00 24.58 28.02 25.64 25.37 NA NA
4 0.0e+00 0.00 0.00 0.00 0.0 NA NA NA NA NA NA NA NA NA
5 3.0e+04 30000.00 -32768.00 -32768.00 0.0 NA NA NA NA NA NA NA NA NA
6 0.0e+00 0.00 0.00 0.00 0.0 0.0 0.0 0.25 20.30 NA NA NA NA NA
7 3.7e+01 cm BT counts 1.0 0.1 NA NA NA NA NA NA NA NA
8 NA 0.25 13.30 145.46 7.5 -11.0 2.1 0.80 157.00 149.00 158.00 143.00 100.00 2147483647
9 NA 0.35 13.37 144.54 7.8 -10.9 2.4 -0.40 153.00 150.00 148.00 146.00 100.00 2147483647
10 NA 0.45 14.49 144.65 8.4 -11.8 1.8 -0.90 139.00 156.00 151.00 152.00 100.00 2147483647
# ... with 155,528 more rows
# Get header information from file and create index(ens) of header information to later append header data to each line of measured data
header <- data %>%
filter(!is.na(data[,1])) %>%
mutate_all(as.character) %>%
mutate(ens = rep(1:(nrow(header)/6), each = 6)) %>%
group_by(ens)
n.head <- bind_cols(header[header$ens == 1,][1,], header[header$ens == 1,][2,], header[header$ens == 1,][3,], header[header$ens == 1,][4,], header[header$ens == 1,][5,], header[header$ens == 1,][6,])
Rows 2:7 have the information I am trying to work with, I know that creating a row of 90+ variables is not ideal, but this is a first step in cleaning this data up so that I can then work with it.
the last row with n.head
is what I am hoping to end up with, without needing to write a loop to run that ~20,000 times... Any help would be appreciated, thank you in advance for input!
Upvotes: 1
Views: 568
Reputation: 3092
The trick here is to use tidy::spread()
and tibble::enframe
to get the header columns spread out into a single row data frame.
library(tidyverse)
header <- data[2:7] %>%
# convert the data frame to a vector
t %>%
as.vector %>%
# then change it back into a single row data frame that's in long format
enframe %>%
# then push that back into a wide format, ie. 1 row and a bajillion columns
spread(name, value)
# replicate the row as many times as you have data
header[2:nrow(actualdata,] <- header
#use bind_cols() to glue your header rows onto each row of the actual data
actualdata <- data[7:nrow(data),] %>%
bind_cols(foo)
Upvotes: 2