nasia jaffri
nasia jaffri

Reputation: 823

How to split data in SPSS based on percentage

I have a 7G file in SPSS format. It has some survey data and has comment level scores and sentence level scores. One comment can have multiple sentences, and one survey has up to 4 comments.

I am trying to do random sampling in SPSS so I can use the smaller file in R, but if I do Simple Random Sampling then I am not able to keep the whole survey and comment together.

What I want is to take a sample from this big file and only pick 5% of the surveyIds, so the rows for the whole survey stays together.

Surv_ID  Sentence_ID Comment_ID Sentence_Score Comment_Score
A001         001       1            3.5             2
A001         002       1            2.8             2
A001         001       2            1.4            -1
A001         002       2           -2.9            -1
A001         003       2           -3.1            -1
A002         001       1            2.3             3
A002         002       1            4.3             3
A002         001       2            1.2             1
A002         002       2            0.85            1
A002         003       2            0.79            1
A002         001       3            3.5             2
A002         002       3           -3.1             2
A002         003       3            2.8             2
A003         001       1             1              1
A003         001       2           -0.9            -3
A003         002       2           -4.3            -3
A003         003       2           -4.0            -3
A003         001       3            3.4             3
A003         002       3            4.4             3
A003         001       4            2.8             2

Upvotes: 2

Views: 786

Answers (1)

Jignesh Sutar
Jignesh Sutar

Reputation: 2929

COMPUTE RandNum=RV.UNIFORM(0,1).
AGGREGATE OUTFILE=* MODE=ADDVARIABLES OVERWRITE=YES /BREAK=Surv_ID /RandNum=MAX(RandNum).
SORT CASES BY RandNum Surv_ID.
COMPUTE SurvIDNum=SUM(LAG(SurvIDNum),(LAG(Surv_ID)<>Surv_ID)=1 OR $CASENUM=1).
AGGREGATE OUTFILE=* MODE=ADDVARIABLES /TotN=N.
COMPUTE SurvIDNumPCT=SurvIDNum/TotN.
SELECT IF (SurvIDNumPCT<0.05).
  1. Create random variable for all cases
  2. Assign a maximum random value for all unique Surv_ID
  3. Sort cases by random variable and clustered by Surv_ID
  4. Create a numeric counter for sequential Surv_ID's
  5. Divide this value by total number of cases to get percentage
  6. Select as many cases as required

For the steps above here are corresponding instructions to where to find relevant GUI equivalents to achieve the same.

  1. Transform -> Compute Variable
  2. Data -> Aggregate
  3. Data -> Sort cases
  4. Transform -> Compute Variable
  5. Transform -> Compute Variable
  6. Data -> Select cases

Upvotes: 1

Related Questions