Reputation: 69
sh
noob so be gentle. This a preprocessing exercise using the command line, (I'm on a mac).
I have a large CSV file (original.csv) ~1M rows, 4 cols. I'd like to create processing script that pulls all rows based on a column value, i.e. get all distinct rows. There are 138393 distinct values in column 1. I'm doing the above via awk
.
From here I want to take ~half of these found values, shuffle the rows (or randomly select) then split the two groups into two CSV files (file1.csv and file2.csv). FWIW it's for an machine learning exercise so splitting data into test/train.
What's an efficient way to do this? The biggest bottle necks I have now, (probably more I don't see):
awk
...BONUS: Any multithreaded solutions to speed up entire process!
My CSV Data is basic (and sorted by column 1 value already):
1,2,3.5,1112486027
1,29,3.5,1112484676
1,32,3.5,1112484819
1,47,3.5,1112484727
CODE:
#!/bin/bash
DATA_FILE=noheader.csv
awk -F "," '{ print >> ("r"$1".csv"); close("r"$1".csv") }' $DATA_FILE # Creates seperate CSV file for each userID
ID_FILE=unique_ids.txt
if [ -e $ID_FILE ]
then
IDX=$(wc -l unique_ids.txt | awk '{print $1}') # Get count of total rows in CSV
printf "Found %d userIDs \n" $IDX
else
printf "File %s Not Found! \n" "$ID_FILE"
printf "Creating Unique IDs File \n"
cut -d , -f1 $DATA_FILE | sort | uniq > unique_ids.txt
fi
COUNT=0
START=$(date +%s)
for ((i=1; i <= $IDX; i++)) # Iterate through each user CSV file
{
FILE=r${i}.csv
TOT_LNO=$(wc -l $FILE | awk -v FILE="$FILE" '{ print $1; close(FILE) }') # Calc total number of rows in file
SPLT_NO=$(($TOT_LNO / 2)) # ~50% split of user row count for test/train split
gshuf -n $TOT_LNO $FILE # Randomly shuffle rows in csv file
head -n $SPLT_NO $FILE >> train_data.csv
OFFSET=$(($SPLT_NO + 1)) # Appends first line# rows of user{n} ratings to training data
tail -n +$OFFSET $FILE >> test_data.csv # Appends rows nums > line# of user{n} ratings to test data
# awk 'FNR==NR{a[$1];next}($1 in a){print}' file2 file1 # Prints out similarities btwn files (make sure not train/test splipapge)
rm $FILE # Deletes temp user rating files before proceding
((COUNT++))
if ! ((COUNT % 10000))
then
printf "processed %d files!\n" $COUNT
fi
}
END=$(date +%s)
TIME=$((END-START))
printf "processing runtime: %d:\n" $TIME
OUTPUT (assuming it was shuffled):
train.csv
1,2,3.5,1112486027
1,47,3.5,1112484727
test.csv
1,32,3.5,1112484819
1,29,3.5,1112484676
Upvotes: 2
Views: 627
Reputation: 8406
This method below is slightly faster than the accepted awk
answer.
Using shuf
, GNU split
's -n
option, and mv
:
grep '^1,' noheader.csv | shuf | split -n r/2 ; mv xaa train.csv ; mv xab test.csv
This won't work a Mac, since those use BSD split
which has no -n
option.
Upvotes: 0
Reputation: 204035
I'm guessing since you didn't provide sample input and expected output that we could test against but it sounds like all you need is:
shuf infile.csv | awk -F, '$1==1{ print > ("outfile" (NR%2)+1 ".csv") }'
If that's not what you want then edit your question to include concise, testable sample input and expected output.
Upvotes: 3