Reputation: 5282
I have a 40M+ csv file. One of the columns is a binary indicator (-1,1). I'd like to know if there is a linux command to create a new file that alternates rows with -1 and 1.
Old:
1,x,y
-1,t,r
-1,e,t
1,r,t
New:
1,x,y
-1,t,r
1,r,t
-1,e,t
Id doesn't have to follow any particular logic about how -1 and 1 are shuffled (could be random) as long as it alternates one row of each. I'm on Ubuntu 12.04.
Upvotes: 2
Views: 1189
Reputation: 37298
Here is a shell/awk solution. Not the most efficient, but given the speed of modern machines, shouldn't be an issue.
first, split data between pos and neg values.
awk '/^-/{print}' minus1Pos1data.txt > negsData.txt
awk '/^[^-]/{print}' minus1Pos1data.txt > posData.txt
Now merge the two files, using awk array to hold first file. you can change order if you want neg numbers as first record.
awk 'pass==1{pos[FNR]=$0} pass==2{print pos[FNR]; print}' pass=1 posData.txt pass=2 negsData.txt > alternateRows.txt
cat alternateRows.txt
1,x,y
-1,t,r
1,r,t
-1,e,t
awk evaluates the variable assignments on the cmd line pass=1
and tests them pass==1
? VS pass==2
? (inside the awk
code) and only performs the block where the pass==?
test is true. Note that pass=1
is an assignment statment, while pass==1
is an equality test.
First pass loads the first file into an array pos
with the current file's record-number (FNR) as the key.
The 2nd pass uses its current record number (FNR) to get the pos
array rec, and the bare print
cmd could be print $0
, which means print the current line (from the pass=2 file).
IHTH.
Upvotes: 2
Reputation: 2343
Here's a one-liner:
paste -d"\n" <( grep '^1,' test.txt ) <( grep '^-1,' test.txt )
Upvotes: 1
Reputation: 2103
Here is another solution using the grep, shuf and paste commands:
shuffle1-1.sh
#!/usr/bin/env bash
input=$1
if [ $# -eq 0 ]
then
echo "must provide a file as 1st parameter..."
exit -1
fi
# split data between pos and neg values and shuffle them
# in temporary files
grep -v "\-1" $input | shuf > tmp_subset1
grep "\-1" $input | shuf > tmp_subsetm1
# alternate 1 and -1 line
paste -d"\n" tmp_subset1 tmp_subsetm1
# cleanup
rm tmp_subset1
rm tmp_subsetm1
output
# ./shuffle1-1.sh test.data
1,x,y
-1,t,r
1,r,t
-1,e,t
# ./shuffle1-1.sh test.data
1,x,y
-1,e,t
1,r,t
-1,t,r
# cat test.data
1,x,y
-1,t,r
-1,e,t
1,r,t
If your file does not have the same number of lines with 1 and -1, adding | grep 1
at the end should get rid of the blank lines:
# ./shuffle1-1.sh test.data2
1,z,z
-1,e,t
1,x,y
-1,t,r
1,r,t
1,Z,Z
# ./shuffle1-1.sh test.data2 | grep 1
1,r,t
-1,t,r
1,x,y
-1,e,t
1,z,z
1,Z,Z
Upvotes: 1