Reputation: 5282

linux command to shuffle a large csv file to alternate rows according to a pattern

I have a 40M+ csv file. One of the columns is a binary indicator (-1,1). I'd like to know if there is a linux command to create a new file that alternates rows with -1 and 1.

Old:

1,x,y
-1,t,r
-1,e,t
1,r,t

New:

1,x,y
-1,t,r
1,r,t
-1,e,t

Id doesn't have to follow any particular logic about how -1 and 1 are shuffled (could be random) as long as it alternates one row of each. I'm on Ubuntu 12.04.

Upvotes: 2

Answers (3)

shellter

Reputation: 37298

Here is a shell/awk solution. Not the most efficient, but given the speed of modern machines, shouldn't be an issue.

first, split data between pos and neg values.

awk '/^-/{print}' minus1Pos1data.txt > negsData.txt
awk '/^[^-]/{print}' minus1Pos1data.txt > posData.txt

Now merge the two files, using awk array to hold first file. you can change order if you want neg numbers as first record.

awk 'pass==1{pos[FNR]=$0} pass==2{print pos[FNR]; print}' pass=1 posData.txt pass=2 negsData.txt > alternateRows.txt

cat alternateRows.txt
1,x,y
-1,t,r
1,r,t
-1,e,t

awk evaluates the variable assignments on the cmd line pass=1 and tests them pass==1? VS pass==2? (inside the awk code) and only performs the block where the pass==? test is true. Note that pass=1 is an assignment statment, while pass==1 is an equality test.

First pass loads the first file into an array pos with the current file's record-number (FNR) as the key.

The 2nd pass uses its current record number (FNR) to get the pos array rec, and the bare print cmd could be print $0, which means print the current line (from the pass=2 file).

IHTH.

Upvotes: 2

Aaron Okano

Reputation: 2343

Here's a one-liner:

paste -d"\n" <( grep '^1,' test.txt ) <( grep '^-1,' test.txt )

Upvotes: 1

David L.

Reputation: 2103

Here is another solution using the grep, shuf and paste commands:

shuffle1-1.sh

#!/usr/bin/env bash

input=$1

if [ $# -eq 0 ]
  then
    echo "must provide a file as 1st parameter..."
    exit -1
fi

# split data between pos and neg values and shuffle them
# in temporary files
grep -v  "\-1" $input | shuf > tmp_subset1
grep "\-1" $input | shuf > tmp_subsetm1

# alternate 1 and -1 line
paste -d"\n" tmp_subset1 tmp_subsetm1

# cleanup
rm tmp_subset1
rm tmp_subsetm1

output

# ./shuffle1-1.sh test.data
1,x,y
-1,t,r
1,r,t
-1,e,t
# ./shuffle1-1.sh test.data
1,x,y
-1,e,t
1,r,t
-1,t,r
# cat test.data
1,x,y
-1,t,r
-1,e,t
1,r,t

If your file does not have the same number of lines with 1 and -1, adding | grep 1 at the end should get rid of the blank lines:

# ./shuffle1-1.sh test.data2
1,z,z
-1,e,t
1,x,y
-1,t,r
1,r,t

1,Z,Z

# ./shuffle1-1.sh test.data2 | grep 1
1,r,t
-1,t,r
1,x,y
-1,e,t
1,z,z
1,Z,Z

Upvotes: 1

linux command to shuffle a large csv file to alternate rows according to a pattern

Answers (3)

Related Questions