Reputation: 129

read tab-separated data into to R

I have the following tab-separated data. I want to read it as a tibble or data.frame`. How do I read this type of data?

#
# Data-value qualification codes included in this output:
# 
# 
agency_cd   site_no datetime    3423_00060_00003    3423_00060_00003_cd
5s  15s 20d 14n 10s
USGS    07055875    2021-09-02  2.42    P
USGS    07055875    2021-09-03  2.19    P
USGS    07055875    2021-09-04  2.18    P
USGS    07055875    2021-09-05      
USGS    07055875    2021-09-06      
USGS    07055875    2021-09-07  5.50    P
#
# Data-value qualification codes included in this output:
# 
agency_cd   site_no datetime    3423_00060_00003    3423_00060_00003_cd
5s  15s 20d 14n 10s
USGS    07056000    2020-09-08  277 A
USGS    07056000    2020-09-09  231 A

Expected output:

   agency_cd site_no  datetime   `3557_00065_00003` `3557_00065_00003_cd`
   <chr>     <chr>    <date>                  <dbl> <chr>                
 1 USGS      07074850 1938-01-01               23.0 A                    
 2 USGS      07074850 1938-01-02               22.2 A

Upvotes: 2

Answers (3)

akrun

Reputation: 887941

It may be better to read with readLines i.e. after reading the file with readLines, use grep to detect lines that are not having # or digits (\\d+) at the start (^) of the lines, then use split to split the lines into a list, based on the 'agency_id' substring, read with read.table while looping over the list with map, and do the conversion of Date column

library(purrr)
library(dplyr)
map_dfr(split(lines2, cumsum(grepl('agency_cd', lines2))), 
     ~ read.table(text = .x, header = TRUE, fill = TRUE, 
       check.names = FALSE)) %>% 
    type.convert(as.is = TRUE) %>%
    mutate(datetime = as.Date(datetime)) %>%
    as_tibble

-output

# A tibble: 8 x 5
  agency_cd site_no datetime   `3423_00060_00003` `3423_00060_00003_cd`
  <chr>       <int> <date>                  <dbl> <chr>                
1 USGS      7055875 2021-09-02               2.42 "P"                  
2 USGS      7055875 2021-09-03               2.19 "P"                  
3 USGS      7055875 2021-09-04               2.18 "P"                  
4 USGS      7055875 2021-09-05              NA    ""                   
5 USGS      7055875 2021-09-06              NA    ""                   
6 USGS      7055875 2021-09-07               5.5  "P"                  
7 USGS      7056000 2020-09-08             277    "A"                  
8 USGS      7056000 2020-09-09             231    "A"

data

lines <- readLines('file.txt')
# or directly read it
lines <- readLines(textConnection("#
# Data-value qualification codes included in this output:
# 
# 
agency_cd   site_no datetime    3423_00060_00003    3423_00060_00003_cd
5s  15s 20d 14n 10s
USGS    07055875    2021-09-02  2.42    P
USGS    07055875    2021-09-03  2.19    P
USGS    07055875    2021-09-04  2.18    P
USGS    07055875    2021-09-05      
USGS    07055875    2021-09-06      
USGS    07055875    2021-09-07  5.50    P
#
# Data-value qualification codes included in this output:
# 
agency_cd   site_no datetime    3423_00060_00003    3423_00060_00003_cd
5s  15s 20d 14n 10s
USGS    07056000    2020-09-08  277 A
USGS    07056000    2020-09-09  231 A"))


lines2 <- lines[!grepl("^(#|\\d+)", lines)]

Upvotes: 2

icj

Reputation: 749

The read_table() function from the readr package is built to work with whitespace separated data. Here's a very simple solution that stays completely within the tidyverse suite of packages:

require(tidyverse)
fpath <- "/Your/Filepath/Here"

df_in <- read_table(file = fpath,
                    col_types = "ccDdc",
                    skip = 4) %>%
  filter(!grepl("^(\\#|\\d+)", agency_cd),
         !agency_cd == "agency_cd")

Upvotes: 0

TarJae

Reputation: 79276

If your file is named test then we could use read.csv, and then data wrangling with dplyr and for date lubridate:

read.csv("~/Data exploration/test", sep="") %>% 
    slice(-1) %>% 
    as_tibble() %>% 
    type.convert(as.is =TRUE) %>% 
    mutate(datetime = lubridate::ymd(datetime))

output:

  agency_cd site_no datetime   X3423_00060_00003 X3423_00060_00003_cd
  <chr>       <int> <date>                 <dbl> <chr>               
1 USGS      7055875 2021-09-02              2.42 P                   
2 USGS      7055875 2021-09-03              2.19 P                   
3 USGS      7055875 2021-09-04              2.18 P                   
4 USGS      7055875 2021-09-05              7.16 P                   
5 USGS      7055875 2021-09-06              9.28 P                   
6 USGS      7055875 2021-09-07              5.5  P

Upvotes: 2

read tab-separated data into to R

Answers (3)

data

Related Questions