Kuljeet Keshav
Kuljeet Keshav

Reputation: 13

Selectively reading CSV file in r

I have a large data file in CSV format of which I need to import only certain rows .Let's call this large file A.csv.

I have another csv file i.e. B.csv which has two columns and some rows.

Now I need to import only those rows of data from A.csv which have first two column values same as the B.csv column values for a certain row .So,I tried this after importing both files

But it seems to take forever

while(count<4632)
{
    count=count+1
    count2=0
    while(count2<17415)
    {
        count2=count 2+1
        if(B[count,1]==A[count2,1])
            dbase[count,]=A[count2,]
    }
}

Please help!!

Upvotes: 1

Views: 1253

Answers (2)

D-Lorenz
D-Lorenz

Reputation: 176

Maybe I have too few information but I will try to reply...

I think you can simply make a join between the two files loading only the smaller one. I will do something like this with the help of the sqldf package:

library(sqldf)

tmp_csv <- "path/of/your/big/file.csv"

# load your small file and make sure the two columns 
# have the same name of the columns of the big file
tmp_df <- read.csv("path/of/your/small/file.csv")

# join the two dataset with a single sql query
out_data <- read.csv2.sql(tmp_csv, sql = "select * from file join tmp_df using (Column1, Column2)", header = TRUE) 

You can use read.csv2.sql or read.csv.sql depending on your separator character. Double check the names of your columns because it is the fundamental part of the join operation.

Upvotes: 0

Jasper
Jasper

Reputation: 555

You have two nested big loops, and you are dynamically growing a vector. Both are bad for performance. Try to vectorise both operations.

For example:

set.seed(123)
dfA <- data.frame(
    a = sample(LETTERS, 10000, TRUE),
    b = sample(LETTERS[1:3], 10000, TRUE),
    c = rnorm( 10000 ),
    stringsAsFactors = FALSE
)
dfB <- data.frame(
    a = sample(LETTERS, 1000, TRUE),
    b = sample(LETTERS[1:3], 1000, TRUE),
    stringsAsFactors = FALSE
)

dfC <- dfA[ which( paste(dfA$a, dfA$b) %in% paste(dfB$a, dfB$a)), ]

Upvotes: 1

Related Questions