Jd Baba
Jd Baba

Reputation: 6118

Reduce the large dataset into smaller data set using R

I want to reduce a very large dataset with two variables into a smaller file. What I want to do is I need to find the data points with the same values and then I want to keep only the starting and ending values and then remove all the data points in between them. For example the sample dataset looks like following :

363.54167   23.3699
363.58333   23.3699
363.625 0
363.66667   0
363.70833   126.16542
363.75  126.16542
363.79167   126.16542
363.83333   126.16542
363.875 126.16542
363.91667   0
363.95833   0
364 0
364.04167   0
364.08333   0
364.125 0
364.16667   0
364.20833   0
364.25  127.79872
364.29167   127.79872
364.33333   127.79872
364.375 127.79872
364.41667   127.79872
364.45833   127.79872
364.5   0
364.54167   0
364.58333   0
364.625 0
364.66667   0
364.70833   127.43202
364.75  135.44052
364.79167   135.25522
364.83333   135.12892
364.875 20.32986
364.91667   0
364.95833   0

Here, the first two points have same values i.e 26.369 so I will keep them as it is. I need to write a condition i.e if two or more data points have same values then keep only starting and ending data points. Then the next two values also have same value i.e. 0 and i will keep these two. However, after that there are 5 data points with the same values. I need to write a program such that I want to write just two data points i.e 363.708 & 363.875 and remove data points in between them. After that I will keep only two data points with zero values i.e 363.91667 and 364.20833.

The sample output I am looking for is as follows:

363.54167   23.3699
363.58333   23.3699
363.625 0
363.66667   0
363.70833   126.16542
363.875 126.16542
363.91667   0
364.20833   0
364.25  127.79872
364.45833   127.79872
364.5   0
364.66667   0
364.70833   127.43202
364.75  135.44052
364.79167   135.25522
364.83333   135.12892
364.875 20.32986
364.91667   0
364.95833       0

Upvotes: 1

Views: 2862

Answers (3)

Martin Morgan
Martin Morgan

Reputation: 46866

x = tapply(df[[1]], df[[2]], range)

gives the values

cbind(unlist(x, use.names=FALSE), as.numeric(rep(names(x), each=2)))

gets a matrix. More explicitly, and avoiding coercion to / from character vectors

u = unique(df[[2]])
rng = sapply(split(df[[1]], match(df[[2]], u)), range)
cbind(as.vector(rng), rep(u, each=2))

If the data is very large then sort by df[[1]] and find the first (min) and last (max) values of each element of df[[2]]; combine these

df = df[order(df[[1]]),]
res = rbind(df[!duplicated(df[[2]]),], df[!duplicated(df[[2]], fromLast=TRUE),])
res[order(res[[2]]),]

perhaps setting the row names of the subset to NULL.

Upvotes: 0

Brian Diggs
Brian Diggs

Reputation: 58835

If your data is in a dataframe DF with column names a and b, then

runs <- rle(DF$b)
firsts <- cumsum(c(0,runs$length[-length(runs$length)]))+1
lasts <- cumsum(runs$length)
edges <- unique(sort(c(firsts, lasts)))
DF[edges,]

gives

> DF[edges,]
          a         b
1  363.5417  23.36990
2  363.5833  23.36990
3  363.6250   0.00000
4  363.6667   0.00000
5  363.7083 126.16542
9  363.8750 126.16542
10 363.9167   0.00000
17 364.2083   0.00000
18 364.2500 127.79872
23 364.4583 127.79872
24 364.5000   0.00000
28 364.6667   0.00000
29 364.7083 127.43202
30 364.7500 135.44052
31 364.7917 135.25522
32 364.8333 135.12892
33 364.8750  20.32986
34 364.9167   0.00000
35 364.9583   0.00000

rle gives the lengths of the groups that have the same value (floating point precision may be an issue if you have more decimal places). firsts and lasts give the row index of the first row of a group and the last row of a group, respectively. Put the indexes together, sort them, and get rid of duplicates (since a group of size one will list the same row as the first and last) and then index DF by the row numbers.

Upvotes: 3

Carl Witthoft
Carl Witthoft

Reputation: 21502

I'd use rle here (no surprise to those who know me :-) . Keeping in mind that you will want to check for approximate equality to avoid floating-point rounding problems, here's the concept. rle will return two sequences, one of which tells you how many times a value is repeated and the other tells you the value itself. Since you want to keep only single or double values, we'll essentially "shrink" all sequence values which are longer. Edit: I recognize that this is relatively clunky code and a gentle touch with melt/cast should be far more efficient. I just liked doing this.

df<-cbind(1:20, sample(1:3,rep=T,20))
rdf<-rle(df[,2])
lenfoo<-rdf$lengths
cfoo<-cumsum(lenfoo)
repfoo<-ifelse(lenfoo==1,1,2)
outfoo<-matrix(nc=2)
for(j in 1:length(cfoo)) outfoo <- rbind( outfoo, matrix(rep(df[cfoo[j],],times=repfoo[j] ), nc=2,byrow=TRUE ) )

Rgames> df
      [,1] [,2]
 [1,]    1    2
 [2,]    2    2
 [3,]    3    3
 [4,]    4    3
 [5,]    5    3
 [6,]    6    3
 [7,]    7    3
 [8,]    8    2
 [9,]    9    2
[10,]   10    3
[11,]   11    1
[12,]   12    2
[13,]   13    2
[14,]   14    3
[15,]   15    1
[16,]   16    2
[17,]   17    1
[18,]   18    2
[19,]   19    3
[20,]   20    1
Rgames> outfoo
      [,1] [,2]
 [1,]   NA   NA
 [2,]    2    2
 [3,]    2    2
 [4,]    7    3
 [5,]    7    3
 [6,]    9    2
 [7,]    9    2
 [8,]   10    3
 [9,]   11    1
[10,]   13    2
[11,]   13    2
[12,]   14    3
[13,]   15    1
[14,]   16    2
[15,]   17    1
[16,]   18    2
[17,]   19    3
[18,]   20    1

Upvotes: 1

Related Questions