Reputation: 13
I have a large data file in CSV format of which I need to import only certain rows .Let's call this large file A.csv.
I have another csv file i.e. B.csv which has two columns and some rows.
Now I need to import only those rows of data from A.csv which have first two column values same as the B.csv column values for a certain row .So,I tried this after importing both files
But it seems to take forever
while(count<4632)
{
count=count+1
count2=0
while(count2<17415)
{
count2=count 2+1
if(B[count,1]==A[count2,1])
dbase[count,]=A[count2,]
}
}
Please help!!
Upvotes: 1
Views: 1253
Reputation: 176
Maybe I have too few information but I will try to reply...
I think you can simply make a join between the two files loading only the smaller one. I will do something like this with the help of the sqldf package:
library(sqldf)
tmp_csv <- "path/of/your/big/file.csv"
# load your small file and make sure the two columns
# have the same name of the columns of the big file
tmp_df <- read.csv("path/of/your/small/file.csv")
# join the two dataset with a single sql query
out_data <- read.csv2.sql(tmp_csv, sql = "select * from file join tmp_df using (Column1, Column2)", header = TRUE)
You can use read.csv2.sql or read.csv.sql depending on your separator character. Double check the names of your columns because it is the fundamental part of the join operation.
Upvotes: 0
Reputation: 555
You have two nested big loops, and you are dynamically growing a vector. Both are bad for performance. Try to vectorise both operations.
For example:
set.seed(123)
dfA <- data.frame(
a = sample(LETTERS, 10000, TRUE),
b = sample(LETTERS[1:3], 10000, TRUE),
c = rnorm( 10000 ),
stringsAsFactors = FALSE
)
dfB <- data.frame(
a = sample(LETTERS, 1000, TRUE),
b = sample(LETTERS[1:3], 1000, TRUE),
stringsAsFactors = FALSE
)
dfC <- dfA[ which( paste(dfA$a, dfA$b) %in% paste(dfB$a, dfB$a)), ]
Upvotes: 1