Henryk Borzymowski
Henryk Borzymowski

Reputation: 1078

Generator function in LSTM Keras for outputting mini batches of one files

I have a generator function which works fine. I have a large list of .txt files in which each file is also quite long. The task would now be to write a generator function which Takes:

  1. a batch of Files
  2. and then a batch of size 128 out of one file

my code now:

data_files_generator <- function(train_set) {

  files <- train_set
  next_file <- 0

  function() {

    # move to the next file (note the <<- assignment operator)
    next_file <<- next_file + 1

    # if we've exhausted all of the files then start again at the
    # beginning of the list (keras generators need to yield
    # data infinitely -- termination is controlled by the epochs
    # and steps_per_epoch arguments to fit_generator())
    if (next_file > length(files))
    {next_file <<- 1}

    # determine the file name
    file <- files[[next_file]]

    text <- read_lines(paste(data_dir, file, sep = "" )) %>%
      str_to_lower() %>%
      str_c(collapse = "\n") %>%
      removeNumbers() %>%
      tokenize_characters(strip_non_alphanum = FALSE, simplify = TRUE)

    text <- text[text %in% chars]

    dataset <- map(
      seq(1, length(text) - maxlen - 1, by = 3), 
      ~list(sentece = text[.x:(.x + maxlen - 1)], next_char = text[.x + maxlen])
    )

    dataset <- transpose(dataset)

    # Vectorization
    x <- array(0, dim = c(length(dataset$sentece), maxlen, length(chars)))
    y <- array(0, dim = c(length(dataset$sentece), length(chars)))

    for(i in 1:length(dataset$sentece)){

      x[i,,] <- sapply(chars, function(x){
        as.integer(x == dataset$sentece[[i]])
      })

      y[i,] <- as.integer(chars == dataset$next_char[[i]])

    }
    rounded_dim <- floor(dim(x)[1]/mini_batch_size)
    match_size_to_batch <- 128 * rounded_dim

    x <- x[1:match_size_to_batch, 1:maxlen, 1:length(chars)]
    y <- y_val[1:match_size_to_batch, 1:length(chars)]

    return(list(x, y))

  }
}

So what is coming is coming in is a Text file which is transformed into smaller pieces of text (of length maxlen) and is then one hot encoded into 0 and 1 matrices.

The problem is that from my code the output is one Data Cube of size maxlen x lenght(chars) x samples where the number of samples is very big and that why I would like my generator function to output always a cube of size maxlen x lenght(chars) x samples(128) and then output the next batch of size maxlen x lenght(chars) x samples until the whole text file is read in and then go to the next text file...

The output for now is an error:

 Error in py_call_impl(callable, dots$args, dots$keywords) : 
  ValueError: Cannot feed value of shape (112512, 40, 43) for Tensor 'lstm_layer_input_1:0', which has shape '(128, 40, 43)' 

hope I have explained it good enough to understand. I think I have to input some kind of for loop to iterate over the sample length but I have no Idea how to include this into the gen. function.

Upvotes: 0

Views: 395

Answers (2)

Henryk Borzymowski
Henryk Borzymowski

Reputation: 1078

I have implemented an for loop which is returning now batches of size 128:

Changed Code:

data_files_generator <- function(train_set) {

  files <- train_set
  next_file <- 0

  function() {

    # move to the next file (note the <<- assignment operator)
    next_file <<- next_file + 1

    # if we've exhausted all of the files then start again at the
    # beginning of the list (keras generators need to yield
    # data infinitely -- termination is controlled by the epochs
    # and steps_per_epoch arguments to fit_generator())
    if (next_file > length(files))
    {next_file <<- 1}

    # determine the file name
    file <- files[[next_file]]

    text <- read_lines(paste(data_dir, file, sep = "" )) %>%
      str_to_lower() %>%
      str_c(collapse = "\n") %>%
      removeNumbers() %>%
      tokenize_characters(strip_non_alphanum = FALSE, simplify = TRUE)

    text <- text[text %in% chars]

    dataset <- map(
      seq(1, length(text) - maxlen - 1, by = 3), 
      ~list(sentece = text[.x:(.x + maxlen - 1)], next_char = text[.x + maxlen])
    )

    dataset <- transpose(dataset)

    # Vectorization
    x <- array(0, dim = c(length(dataset$sentece), maxlen, length(chars)))
    y <- array(0, dim = c(length(dataset$sentece), length(chars)))

    for(i in 1:length(dataset$sentece)){

      x[i,,] <- sapply(chars, function(x){
        as.integer(x == dataset$sentece[[i]])
      })

      y[i,] <- as.integer(chars == dataset$next_char[[i]])

    }
    rounded_dim <- floor(dim(x)[1]/mini_batch_size)
    match_size_to_batch <- 128 * rounded_dim

    x <- x[1:match_size_to_batch, 1:maxlen, 1:length(chars)]
    y <- y_val[1:match_size_to_batch, 1:length(chars)]

    #Edit:
    span_start <-1
    for (iter in 1:rounded_dim){
     i <- iter * 128
     span_end <- iter * 128
     x <- x[span_start:span_end, 1:maxlen, 1:length(chars)]
     y <- y[span_start:span_end, 1:length(chars)]
     span_start <- i
     return(list(x, y))
    }
  }
}

Upvotes: 1

mickey
mickey

Reputation: 2188

According to the error, you're trying to feed in an object of shape (112512, 40, 43) but your LSTM layer is expecting an object of shape (128, 40, 43). There seems to be some missing code, but when you're defining the input layer, are you fixing the batch size? I've had luck with defining my input layer as:

l_input = Input(shape = (None, num_features), name = 'input_layer')

I suspect the error is due to these lines of code:

rounded_dim <- floor(dim(x)[1]/mini_batch_size)
match_size_to_batch <- 128 * rounded_dim

This gives you a batch size much larger than 128. From the Keras documentation, the input shape should be (batch_size, timesteps, input_dim). The batch sizes need not be the same throughout an entire epic, but for a batch they all need to have the same number of timesteps (which it looks like you handle with maxlen).

Upvotes: 1

Related Questions