Reputation: 123

How to discard row based on a common column of two files using Linux cmd?

file1.csv

col1,col2,col3
a,b,c
b,v,n
x,u,v
t,m,m

file2.csv

col1,col2,col3
p,m,n
a,z,i

col1 acts as a primary key in both files. If any value of col1 of file2.csv occurs in file1.csv, then this row will be discarded from file1.csv.

output file:

col1,col2,col3
b,v,n
x,u,v
t,m,m

Note: I'm interested in Unix based solution. please provide a solution using sort, uniq, join, etc.

Upvotes: 0

Answers (4)

jubilatious1

Reputation: 2341

Using Raku (formerly known as Perl_6)

    ~$ raku -e ' \
    my @needle   = "/path/to/file2.csv".IO.lines.skip(1).map: *.split(","); \
    my @haystack = "/path/to/file1.csv".IO.lines.skip(1).map: *.split(","); \
    my @x = ([Z] @needle).[0]; my @y = @haystack.map(*.[0,3...*]); \
    .join(",").put for @haystack[@y.grep({!/@x/}, :k)]; ' file

Sample Output:

b,v,n
x,u,v
t,m,m

Above is a solution using the Raku programming language, a member of the Perl family of programming languages.

Briefly, files are read line-wise, skip-ping the first (header) row, and map is used to split each line on "," (commas). Each csv file is stored as an @-sigiled Raku array.

After the two code lines of object creation, each array is distilled down into a single column of values. The @needle array is distilled down to @x while the @haystack object is distilled down to @y. To highlight Raku's flexibility (TIMTOWDI), two different ways of extracting the desired column are shown:

@x uses the [Z] "zip" operator to pull off elements of each row such that the first elements of each row (i.e. the first column) are extracted.
@y uses a numeric indexing sequence [0,3...*] to pull out the desired elements. Raku is smart enough to figure out the remaining sequence values [0,3,6,9].

Finally, Raku's grep function is used to search for matches between @x and @y. The :k (key) "adverb" parameter tells Raku to return the numeric indexing position of matches, or in this case--non-matches, because grep uses a !{…} Boolean block to negate the returned match values.

ADDENDUM: As @EdMorton has noted in the comments, the code above does not output column names (the first row of file1.csv). That's easily remedied by prepending the following line (preferably making it the line before my @needle =... .) :

"/path/to/file1.csv".IO.lines[0].put;

https://raku.org

Upvotes: 0

Ed Morton

Reputation: 204558

Using any awk in any shell on every Unix box:

$ awk -F, '
    FNR==1 { if (NR==1) print; next }
    NR==FNR { a[$1]; next }
    !($1 in a)
' file2.csv file1.csv
col1,col2,col3
b,v,n
x,u,v
t,m,m

Upvotes: 1

jubilatious1

Reputation: 2341

Using the R programming language:

> needle <- read.csv("/path/to/file2.csv", stringsAsFactors=FALSE)
> needle
  col1 col2 col3
1    p    m    n
2    a    z    i
> 
> haystack <- read.csv("/path/to/file1.csv", stringsAsFactors=FALSE)
> haystack
  col1 col2 col3
1    a    b    c
2    b    v    n
3    x    u    v
4    t    m    m
> 
> haystack[!(haystack$col1 %in% needle$col1), ]
  col1 col2 col3
2    b    v    n
3    x    u    v
4    t    m    m
>

The above code should be pretty self-explanatory. For two-dimensional data in R, indexing is accomplished with comma-separated subscripts. As an example, for the dataframe csv_data[i,j] the i-indexing_value represents rows while the j-indexing_value represents columns.

[Indexing in R starts from number 1, not 0. Also, including the parameter stringsAsFactors=FALSE to each of your read.csv() function calls may be superfluous depending on the age of your R-installation].

The last line gives the desired result, and you can read the code somewhat literally to mean: "for all columns of the haystack dataframe find and return rows where haystack$col1 does not contain (i.e. not %in%) needle$col1 values".

https://www.r-project.org/
https://cran.r-project.org/index.html

Upvotes: 0

Ionuț G. Stan

Reputation: 179199

There's a join command on Unix that does pretty much exactly what you want:

join -v1 -t, \
  <(tail +2 file1.txt | sort -k1 -t,) \
  <(tail +2 file2.txt | sort -k1 -t,)

For the sample files you've given, this is its output:

b,v,n
t,m,m
x,u,v

Command Breakdown

join -v1 -t,
- -v1: display rows from the 1st file that are not pairable with lines in the 2nd file via the join column (default column used for joining is 1, but it's overridable via the -1 and -2 options)
- -t,: use comma as a field/column separator
<(tail +2 file1.txt | sort -k1 -t,)
- <( … ): the join command expects file names as arguments, so we use process substitution to create such temporary files from the output of the nested commands
- tail +2 file1.txt: skip the header line
- sort -k1 -t,: the join command expects sorted files
  - -t,: use comma as a field/column separator
  - -k1: sort by the 1st field

Upvotes: 2

How to discard row based on a common column of two files using Linux cmd?

Answers (4)

Command Breakdown

Related Questions