Reputation: 129
I have the following tab-separated data. I want to read it as a tibble
or data.frame`. How do I read this type of data?
#
# Data-value qualification codes included in this output:
#
#
agency_cd site_no datetime 3423_00060_00003 3423_00060_00003_cd
5s 15s 20d 14n 10s
USGS 07055875 2021-09-02 2.42 P
USGS 07055875 2021-09-03 2.19 P
USGS 07055875 2021-09-04 2.18 P
USGS 07055875 2021-09-05
USGS 07055875 2021-09-06
USGS 07055875 2021-09-07 5.50 P
#
# Data-value qualification codes included in this output:
#
agency_cd site_no datetime 3423_00060_00003 3423_00060_00003_cd
5s 15s 20d 14n 10s
USGS 07056000 2020-09-08 277 A
USGS 07056000 2020-09-09 231 A
Expected output:
agency_cd site_no datetime `3557_00065_00003` `3557_00065_00003_cd`
<chr> <chr> <date> <dbl> <chr>
1 USGS 07074850 1938-01-01 23.0 A
2 USGS 07074850 1938-01-02 22.2 A
Upvotes: 2
Views: 326
Reputation: 887941
It may be better to read with readLines
i.e. after reading the file with readLines
, use grep
to detect lines that are not having #
or digits (\\d+
) at the start (^
) of the lines, then use split
to split the lines into a list
, based on the 'agency_id' substring, read with read.table
while looping over the list
with map
, and do the conversion of Date
column
library(purrr)
library(dplyr)
map_dfr(split(lines2, cumsum(grepl('agency_cd', lines2))),
~ read.table(text = .x, header = TRUE, fill = TRUE,
check.names = FALSE)) %>%
type.convert(as.is = TRUE) %>%
mutate(datetime = as.Date(datetime)) %>%
as_tibble
-output
# A tibble: 8 x 5
agency_cd site_no datetime `3423_00060_00003` `3423_00060_00003_cd`
<chr> <int> <date> <dbl> <chr>
1 USGS 7055875 2021-09-02 2.42 "P"
2 USGS 7055875 2021-09-03 2.19 "P"
3 USGS 7055875 2021-09-04 2.18 "P"
4 USGS 7055875 2021-09-05 NA ""
5 USGS 7055875 2021-09-06 NA ""
6 USGS 7055875 2021-09-07 5.5 "P"
7 USGS 7056000 2020-09-08 277 "A"
8 USGS 7056000 2020-09-09 231 "A"
lines <- readLines('file.txt')
# or directly read it
lines <- readLines(textConnection("#
# Data-value qualification codes included in this output:
#
#
agency_cd site_no datetime 3423_00060_00003 3423_00060_00003_cd
5s 15s 20d 14n 10s
USGS 07055875 2021-09-02 2.42 P
USGS 07055875 2021-09-03 2.19 P
USGS 07055875 2021-09-04 2.18 P
USGS 07055875 2021-09-05
USGS 07055875 2021-09-06
USGS 07055875 2021-09-07 5.50 P
#
# Data-value qualification codes included in this output:
#
agency_cd site_no datetime 3423_00060_00003 3423_00060_00003_cd
5s 15s 20d 14n 10s
USGS 07056000 2020-09-08 277 A
USGS 07056000 2020-09-09 231 A"))
lines2 <- lines[!grepl("^(#|\\d+)", lines)]
Upvotes: 2
Reputation: 749
The read_table()
function from the readr package is built to work with whitespace separated data. Here's a very simple solution that stays completely within the tidyverse suite of packages:
require(tidyverse)
fpath <- "/Your/Filepath/Here"
df_in <- read_table(file = fpath,
col_types = "ccDdc",
skip = 4) %>%
filter(!grepl("^(\\#|\\d+)", agency_cd),
!agency_cd == "agency_cd")
Upvotes: 0
Reputation: 79276
If your file is named test
then we could use read.csv
, and then data wrangling with dplyr
and for date lubridate
:
read.csv("~/Data exploration/test", sep="") %>%
slice(-1) %>%
as_tibble() %>%
type.convert(as.is =TRUE) %>%
mutate(datetime = lubridate::ymd(datetime))
output:
agency_cd site_no datetime X3423_00060_00003 X3423_00060_00003_cd
<chr> <int> <date> <dbl> <chr>
1 USGS 7055875 2021-09-02 2.42 P
2 USGS 7055875 2021-09-03 2.19 P
3 USGS 7055875 2021-09-04 2.18 P
4 USGS 7055875 2021-09-05 7.16 P
5 USGS 7055875 2021-09-06 9.28 P
6 USGS 7055875 2021-09-07 5.5 P
Upvotes: 2