Pascoe
Pascoe

Reputation: 177

R system call to awk fail

I have a log file, let's call it mylogfile.txt

Format is date-timestamp, then semicolon delimiter, then some other stuff that I am, for the purposes of this exercise, unconcerned with.

eg (this is all one line in the log file - not sure how to present as such in SO so apologies)

20170710-23:59:43.158;[email protected]@1000000.0@20170710-21:15:53.23@@2017071023:59:43.158@@T@20170710-23:59:43.156#[email protected]@4000000.0@20170710-21:15:53.23@@20170710-23:59:43.158@@T@20170710-23:59:43.156#[email protected]@1000000.0@20170710-21:15:53.23@@20170710-23:59:43.158@@T@20170710-23:59:43.156#[email protected]@4000000.0@20170710-21:15:53.23@@20170710-23:59:43.158@@T@20170710-23:59:43.156#

What I am currently attempting is simply a proof of concept example. I wish to parse the file, reverse the row order, and return two columns in the output -

1) Just the timestamp parsed from column one (which is a date-time format so I need to discard the date portion)

2) That timestamp expressed in seconds since midnight , expressed to millisecond precision (in line with the granularity of the timestamps themselves.

so from the single line example below the output would be eg

23:59:43.158,86383.158

I can get halfway there. I can construct a call to awk using syntax which works perfectly well within cygwin (stripped of the R wrapper naturally). But it doesn't work within R

testawk<-paste0("tac ", mylogfile.txt, " | awk 'BEGIN {FS=\"-|;|:\"} {OMFT=\"%.3f\"} {print $2 \":\" $3 \":\" $4 \",\" (3600*$2)+(60*$3)+$4}' ")

getawk<-as.data.frame(system(testawk, intern=TRUE, show.output.on.console = FALSE))

However what ends up in the data frame getawk is simply the raw log file churning through as it's being read. Plus I get the warning message that running command had status 1.

HOWEVER

if I strip out the 'tac' piece and just use straight awk, thus;

 testawk<-paste0("awk 'BEGIN {FS=\"-|;|:\"} {OMFT=\"%.3f\"} {print $2 \":\" $3 \":\" $4 \",\" (3600*$2)+(60*$3)+$4}' ", mylogfile.txt)

    getawk<-as.data.frame(system(testawk, intern=TRUE, show.output.on.console = FALSE))

I get the error message

Error in system(testawk, intern = TRUE, show.output.on.console = FALSE) : 'awk' not found

I don't think the problem is in my awk construction as it works fine if I simply do it within cygwin. So there's clearly some facet of the r / system / awk interaction that I am not quite fully grasping.

I imagine if I wrapped this all up in an awk script and simply called the script it may work, but I am frustrated that I can't simply find the right syntax to invoke awk directly with the R system command (I handle grep, sed commands etc that way ok).

It's not as simple as awk not actually being supported at all is it?

Pointers greatly appreciated. If the first say 20 lines of the logfile would be useful I can post those too.

Upvotes: 0

Views: 747

Answers (3)

mlegge
mlegge

Reputation: 6913

This often happens when trying to use other languages with R, e.g. Python. If you haven't added the paths to your Windows system path then you haven't told RStudio where to find the executables.

The root of Cygwin is normally found at C:\cygwin64 (but could vary by your installation) so find the install and look for the bin folder. In there should be the awk executable, but it is normally just a symlink to a gawk executable (verify yourself) so add that to the PATH, e.g.:

Sys.setenv(PATH = paste("C:/cygwin64/bin/gawk", Sys.getenv("PATH"), sep = ":"))

NOTE: This does not add permanently so you must start at the beginning of each session or add to your Windows path to have it recognized permanently.

Upvotes: 1

hrbrmstr
hrbrmstr

Reputation: 78792

Just do it all in R:

c(
  "20170710-10:31:26.121;[email protected]@1000000.0@20170710-21:15:53.23@@2017071023:59:43.158@@T@20170710-23:59:43.156#[email protected]@4000000.0@20170710-21:15:53.23@@20170710-23:59:43.158@@T@20170710-23:59:43.156#[email protected]@1000000.0@20170710-21:15:53.23@@20170710-23:59:43.158@@T@20170710-23:59:43.156#[email protected]@4000000.0@20170710-21:15:53.23@@20170710-23:59:43.158@@T@20170710-23:59:43.156#",
  "20170710-23:59:43.158;[email protected]@1000000.0@20170710-21:15:53.23@@2017071023:59:43.158@@T@20170710-23:59:43.156#[email protected]@4000000.0@20170710-21:15:53.23@@20170710-23:59:43.158@@T@20170710-23:59:43.156#[email protected]@1000000.0@20170710-21:15:53.23@@20170710-23:59:43.158@@T@20170710-23:59:43.156#[email protected]@4000000.0@20170710-21:15:53.23@@20170710-23:59:43.158@@T@20170710-23:59:43.156#"
) -> log_lines

# you'd get the above with `log_lines <- readLines('filename')`

matched <- stringi::stri_match_first_regex(log_lines, "([[:digit:]]+:[[:digit:]]+:[[:digit:]]+\\.[[:digit:]]+)")[,2]

cat(
  rev(
    sprintf(
      "%s,%s\n", 
      matched, 
      lubridate::hms(matched) %>% 
        as.numeric() %>% 
        sprintf("%9.3f", .)
    )
  ),
  sep=""
)

That makes:

10:31:26.121,37886.121
23:59:43.158,86383.158

and, you can cat to a file or store that in a data frame (etc).

I grok that awk might be more familiar to you, but it makes absolutely no sense to use it.

Upvotes: 0

Habakuk
Habakuk

Reputation: 11

sounds like 'awk' is simply not found, maybe it's not in your PATH. Try putting in the full path to awk, e.g. '/usr/bin/awk'. I'm not using Windows and Cygwin, so your real path will certainly be different.

Upvotes: 1

Related Questions