Pxu80
Pxu80

Reputation: 154

Quanteda warning: number of columns of result is not a multiple of vector length (arg 2030)

Trying to parse over 7000 txt files using the readtext library (that ships with the quanteda library) in R, I got the following warning.

Warning message: In (function (..., deparse.level = 1) : number of columns of result is not a multiple of vector length (arg 2030)

How can I figure out which txt file(s) cause(s) the warning?

Using the verbose-option does not show were the warning occurs. For your information, trying to parse two files I get the following information (b2w if I only parse 1 doc at a time, the warning is not shown).

Reading texts from /Users/OS/surfdrive/Competenties/Data-step-1/BinnenlandsBestuur/1982/9-12/Office Lens 20170308-102311.jpg.txtReading texts from /Users/OS/surfdrive/Competenties/Data-step-1/BinnenlandsBestuur/1983/Office Lens 20170308-103518.jpg.txt, using glob pattern ... reading (txt) file: Office Lens 20170308-102311.jpg.txt , using glob pattern ... reading (txt) file: Office Lens 20170308-103518.jpg.txt read 2 documents. Warning messages: 1: In (function (..., deparse.level = 1) : number of columns of result is not a multiple of vector length (arg 2) 2: In if (verbosity == 2 & nchar(msg) > 70) pad <- paste0("\n", pad) : the condition has length > 1 and only the first element will be used

Session info
R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.5

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] C/C/C/C/C/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] tm.plugin.webmining_1.3 XML_3.98-1.7            readtext_0.50           RoogleVision_0.0.1.1   
 [5] outliers_0.14           stringdist_0.9.4.4      ltm_1.0-0               polycor_0.7-9          
 [9] msm_1.6.4               MASS_7.3-47             psych_1.7.5             WriteXLS_4.0.0         
[13] plyr_1.8.4              metafor_2.0-0           Matrix_1.2-9            metaSEM_0.9.14         
[17] OpenMx_2.7.12           xlsx_0.5.7              xlsxjars_0.6.1          rJava_0.9-8            
[21] readxl_1.0.0            quanteda_0.9.9-65       koRpus.lang.nl_0.01-3   koRpus_0.11-1          
[25] sylly_0.1-1             jsonlite_1.5            httr_1.2.1             

loaded via a namespace (and not attached):
 [1] sylly.ru_0.1-1      splines_3.4.0       ellipse_0.3-8       RcppParallel_4.3.20 shiny_1.0.3        
 [6] sylly.it_0.1-1      expm_0.999-2        sylly.es_0.1-1      cellranger_1.1.0    slam_0.1-40        
[11] yaml_2.1.14         backports_1.1.0     lattice_0.20-35     digest_0.6.12       googleAuthR_0.5.1  
[16] colorspace_1.3-2    htmltools_0.3.6     httpuv_1.3.3        tm_0.7-1            devtools_1.13.2    
[21] xtable_1.8-2        mvtnorm_1.0-6       scales_0.4.1        tibble_1.3.3        openssl_0.9.6      
[26] ggplot2_2.2.1       withr_1.0.2         lazyeval_0.2.0      NLP_0.1-10          mnormt_1.5-5       
[31] RJSONIO_1.3-0       survival_2.41-3     magrittr_1.5        mime_0.5            memoise_1.1.0      
[36] evaluate_0.10       boilerpipeR_1.3     nlme_3.1-131        foreign_0.8-67      rsconnect_0.8      
[41] tools_3.4.0         data.table_1.10.4   stringr_1.2.0       munsell_0.4.3       compiler_3.4.0     
[46] rlang_0.1.1         grid_3.4.0          RCurl_1.95-4.8      bitops_1.0-6        rmarkdown_1.5      
[51] gtable_0.2.0        curl_2.6            R6_2.2.2            sylly.en_0.1-1      knitr_1.16         
[56] fastmatch_1.1-0     sylly.fr_0.1-1      rprojroot_1.2       stringi_1.1.5       parallel_3.4.0     
[61] sylly.de_0.1-1      Rcpp_0.12.11 

Thank you, Peter

PS. If this info is insufficient, I'll post a reproducible example on the github page.

Upvotes: 1

Views: 1573

Answers (1)

Andrew Brēza
Andrew Brēza

Reputation: 8317

You can use purrr to look for columns that don't match what you want.

First let's create some demo data with one file that has different names from the other three...

library(tidyverse)
library(purrr)
library(stringr)
old_wd <- getwd()
setwd(tempdir())

demo_data <- tibble(x = rnorm(327),
                    y = rnorm(327),
                    z = rnorm(327))

write_csv(demo_data, "demo1.csv")
write_csv(demo_data, "demo2.csv")
write_csv(demo_data, "demo3.csv")

bad_data <-
  tibble(
    x = rnorm(327),
    y = rnorm(327),
    z = rnorm(327),
    extra_column = rnorm(327)
  )

write_csv(bad_data, "demo4.csv")

Now define what the column names should be. For this example, the correct names are x, y, and z,

correct_names <- c("x", "y", "z")

This function will read a csv and check if all of the names match the column names in correct_names.

get_csv_names <- function(path){
  c(path, all(names(read_csv(path)) == correct_names))
}

I'm assuming that you want to process all of the csv files in your working directory. Otherwise you'll want to change the value of files from what I have below...

files <- list.files() %>% 
  tbl_df() %>% 
  filter(str_detect(value, ".csv")) %>% 
  pull()

Now it's just a matter of mapping files to the function get_csv_names. Notice how demo4.csv has a value of FALSE, which means that its column names do not match what you specified in correct_names...

map(files, get_csv_names)

# [[1]]
# [1] "demo1.csv" "TRUE"     
# 
# [[2]]
# [1] "demo2.csv" "TRUE"     
# 
# [[3]]
# [1] "demo3.csv" "TRUE"     
# 
# [[4]]
# [1] "demo4.csv" "FALSE"  

Since we changed the working directory at the beginning, it's a good idea to reset it at the end.

setwd(old_wd)

Upvotes: 1

Related Questions