colbythenoob
colbythenoob

Reputation: 69

Copy rows from CSV based on column value; then split into separate shuffled CSVs

sh noob so be gentle. This a preprocessing exercise using the command line, (I'm on a mac).

I have a large CSV file (original.csv) ~1M rows, 4 cols. I'd like to create processing script that pulls all rows based on a column value, i.e. get all distinct rows. There are 138393 distinct values in column 1. I'm doing the above via awk.

From here I want to take ~half of these found values, shuffle the rows (or randomly select) then split the two groups into two CSV files (file1.csv and file2.csv). FWIW it's for an machine learning exercise so splitting data into test/train.

What's an efficient way to do this? The biggest bottle necks I have now, (probably more I don't see):

  1. Searching for column value match via awk
  2. IO cost from copying rows to separate csv, then going through each csv + appending half values into train.csv and test.csv
  3. Shuffling each file above

...BONUS: Any multithreaded solutions to speed up entire process!

My CSV Data is basic (and sorted by column 1 value already):

1,2,3.5,1112486027
1,29,3.5,1112484676
1,32,3.5,1112484819
1,47,3.5,1112484727

CODE:

#!/bin/bash


DATA_FILE=noheader.csv
awk -F "," '{ print >> ("r"$1".csv"); close("r"$1".csv") }' $DATA_FILE          # Creates seperate CSV file for each userID

ID_FILE=unique_ids.txt
if [ -e $ID_FILE ]
then
    IDX=$(wc -l unique_ids.txt | awk '{print $1}')                              # Get count of total rows in CSV 
    printf "Found %d userIDs \n" $IDX
else
   printf "File %s Not Found! \n" "$ID_FILE"
   printf "Creating Unique IDs File \n"
   cut -d , -f1 $DATA_FILE | sort | uniq > unique_ids.txt
fi

COUNT=0
START=$(date +%s)
for ((i=1; i <= $IDX; i++))                                                     # Iterate through each user CSV file 
{
    FILE=r${i}.csv

    TOT_LNO=$(wc -l $FILE  | awk -v FILE="$FILE" '{ print $1; close(FILE) }')   # Calc total number of rows in file
    SPLT_NO=$(($TOT_LNO / 2))                                                   # ~50% split of user row count for test/train split

    gshuf -n $TOT_LNO $FILE                                                     # Randomly shuffle rows in csv file

    head -n $SPLT_NO $FILE >> train_data.csv
    OFFSET=$(($SPLT_NO + 1))                                                    # Appends first line# rows of user{n} ratings to training data
    tail -n +$OFFSET $FILE >> test_data.csv                                     # Appends rows nums > line# of user{n} ratings to test data

    # awk 'FNR==NR{a[$1];next}($1 in a){print}' file2 file1                     # Prints out similarities btwn files (make sure not train/test splipapge)
    rm $FILE                                                                    # Deletes temp user rating files before proceding

    ((COUNT++))
    if ! ((COUNT % 10000))
        then
        printf "processed %d files!\n" $COUNT
    fi
}

END=$(date +%s)
TIME=$((END-START))
printf "processing runtime: %d:\n" $TIME

OUTPUT (assuming it was shuffled):

train.csv 
1,2,3.5,1112486027
1,47,3.5,1112484727

test.csv
1,32,3.5,1112484819
1,29,3.5,1112484676

Upvotes: 2

Views: 627

Answers (2)

agc
agc

Reputation: 8406

This method below is slightly faster than the accepted awk answer.

Using shuf, GNU split's -n option, and mv:

grep '^1,' noheader.csv | shuf | split -n r/2 ; mv xaa train.csv ; mv xab test.csv

This won't work a Mac, since those use BSD split which has no -n option.

Upvotes: 0

Ed Morton
Ed Morton

Reputation: 204035

I'm guessing since you didn't provide sample input and expected output that we could test against but it sounds like all you need is:

shuf infile.csv | awk -F, '$1==1{ print > ("outfile" (NR%2)+1 ".csv") }'

If that's not what you want then edit your question to include concise, testable sample input and expected output.

Upvotes: 3

Related Questions