chopin_is_the_best
chopin_is_the_best

Reputation: 2101

Read a random sample from URL

I want to read a random sample of a csv-formatted file from URL.

So far:

library(tidyverse)
library(data.table)

# load dataset from url, skip the first 16 rows
# then *after* reading it completely, use dplyr function
# for sampling. quite dumb, I want to do it while 
# reading the file

df <- read.csv('http://datashaping.com/passwords.txt', header = F, skip = 16) %>%
  sample_frac(.01) %>% 
  rename(password = V1)

Then I tried, as suggested in several posts:

df <- fread("shuf -n 10 http://datashaping.com/passwords.txt", skip = 16, header = F)

But it doesn't work for me. Error:

shuf: 'http://datashaping.com/passwords.txt': No such file or directory
Error in fread("shuf -n 10 http://datashaping.com/passwords.txt", skip = 16,  : 
  File is empty: /dev/shm/file1ab1608b13cf

Moreover, fread seems to be rather slow.

Any idea? Benchmarks?

Edit

I try to benchmark read.csv() vs. fread():

benchmark("read.csv" = {
            df <- read.csv('http://datashaping.com/passwords.txt', header = F, skip = 16)
            df <- df %>%
                sample_n(10) %>% 
                rename(password = V1)
          }, {
          df <- fread("wget -S -O - http://datashaping.com/passwords.txt | shuf -n10") 
          },
          replications = 100,
          columns = c("test", "replications", "elapsed",
                      "relative", "user.self", "sys.self"))

Warning message in fread("wget -S -O - http://datashaping.com/passwords.txt | shuf -n10"):
“Stopped reading at empty line 9 but text exists afterwards (discarded): 08090728”Warning message in fread("wget -S -O - http://datashaping.com/passwords.txt | shuf -n10"):
“Stopped reading at empty line 6 but text exists afterwards (discarded): 0307737205”

Upvotes: 0

Views: 254

Answers (1)

mysteRious
mysteRious

Reputation: 4304

Looks like that file is not a CSV, and the data starts on line 15. I am on Windows 10 right now & this worked for me very quickly (whole sample, not random sample):

> test <- fread("http://datashaping.com/passwords.txt",skip=15)
trying URL 'http://datashaping.com/passwords.txt'
Content type 'text/plain' length 20163417 bytes (19.2 MB)
downloaded 19.2 MB

Read 2069847 rows and 1 (of 1) columns from 0.019 GB file in 00:00:03

It provides a data.table structure as expected:

> str(test)
Classes ‘data.table’ and 'data.frame':  2069847 obs. of  1 variable:
 $ #: chr  "07606374520" "piontekendre" "rambo144" "primoz123" ...
 - attr(*, ".internal.selfref")=<externalptr> 

You can access all the data like this (use with=FALSE to reference by column number):

> test[,1,with=FALSE]
                    #
      1:  07606374520
      2: piontekendre
      3:     rambo144
      4:    primoz123
      5:      sal1387
     ---             
2069843:     26778982
2069844:      brazer1
2069845:   usethisone
2069846:  scare222273
2069847:     anto1962

And you can access individual passwords like this:

> test[1,1,with=FALSE]
             #
1: 07606374520
> test[5,1,with=FALSE]
         #
1: sal1387

Upvotes: 1

Related Questions