Ravi
Ravi

Reputation: 3303

Use of environment variables in R

I am trying to understand the reducer.R code taken from the following website.

http://www.thecloudavenue.com/2013/10/mapreduce-programming-in-r-using-hadoop.html

This code is using for Hadoop Streaming using R.

I have given the code below:

    #! /usr/bin/env Rscript
    # reducer.R - Wordcount program in R
    # script for Reducer (R-Hadoop integration)

    trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)

    splitLine <- function(line) {
      val <- unlist(strsplit(line, "\t"))
      list(word = val[1], count = as.integer(val[2]))
    }

    env <- new.env(hash = TRUE)
    con <- file("stdin", open = "r")

    while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
     line <- trimWhiteSpace(line)
     split <- splitLine(line)
     word <- split$word
     count <- split$count

    if (exists(word, envir = env, inherits = FALSE)) {
      oldcount <- get(word, envir = env)
      assign(word, oldcount + count, envir = env)
      }
      else assign(word, count, envir = env)
      }
    close(con)

    for (w in ls(env, all = TRUE))
      cat(w, "\t", get(w, envir = env), "\n", sep = "")

Could someone explain the significance of the use of the following new.env command and the subsequent use of the env in the code:

    env <- new.env(hash = TRUE)

Why is this required? What happens if this is not included in the code?

Update 06/05/2014

I tried writing another version of this code without having a new environment defined and have given the code as follows:

    #! /usr/bin/env Rscript
    current_word <- ""
    current_count <- 0
    word <- ""

    con <- file("stdin", open = "r")

    while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) 
    {
      line1 <- gsub("(^ +)|( +$)", "", line)
      word <- unlist(strsplit(line1, "[[:space:]]+"))[[1]]
      count <- as.numeric(unlist(strsplit(line1, "[[:space:]]+"))[[2]])

      if (current_word == word) {
        current_count = current_count + count
      } else 
      {
    if(current_word != "")     
        {
           cat(current_word,'\t', current_count,'\n')
        }    
        current_count = count
        current_word = word
      }
    }

    if (current_word == word) 
    {
      cat(current_word,'\t', current_count,'\n')
    }

    close(con)

This code gives the same output as the one with a new environment defined.

Question: Does using new environment provide any advantages from a Hadoop standpoint? Is there a reason for using it in this specific case?

Thank you.

Upvotes: 0

Views: 324

Answers (1)

rischan
rischan

Reputation: 1585

Your question is related with environment in R, example code for make new environment in R

> my.env <- new.env()
> my.env
<environment: 0x114a9d940>
> ls(my.env)
character(0)
> assign("a", 999, envir=my.env)
> my.env$foo = "This is the variable foo."
> ls(my.env)
[1] "a"   "foo"

I think this article can help you http://www.r-bloggers.com/environments-in-r/ or press

?environment

for more help

Like on code that you give, the author make a new environmnt.

 env <- new.env(hash = TRUE)

when he want to assign value they defined the environment

assign(word, oldcount + count, envir = env)

And for the question "What happens if this is not included in the code?" I think you can find the answer on the link that I already provided

About the advantages using new env in R is already answered in this link

so the reason is in this case you will play with the large of dataset, when you passing your dataset to the function, R will make a copy your dataset and then the return data will overwrite the old dataset. But if you passing env, R will directly process that env without copying large dataset.

Upvotes: 3

Related Questions