Reputation: 123
file1.csv
col1,col2,col3
a,b,c
b,v,n
x,u,v
t,m,m
file2.csv
col1,col2,col3
p,m,n
a,z,i
col1
acts as a primary key in both files. If any value of col1
of file2.csv
occurs in file1.csv
, then this row will be discarded from file1.csv
.
output file:
col1,col2,col3
b,v,n
x,u,v
t,m,m
Note: I'm interested in Unix
based solution. please provide a solution using sort
, uniq
, join
, etc.
Upvotes: 0
Views: 130
Reputation: 2289
Using Raku (formerly known as Perl_6)
~$ raku -e ' \
my @needle = "/path/to/file2.csv".IO.lines.skip(1).map: *.split(","); \
my @haystack = "/path/to/file1.csv".IO.lines.skip(1).map: *.split(","); \
my @x = ([Z] @needle).[0]; my @y = @haystack.map(*.[0,3...*]); \
.join(",").put for @haystack[@y.grep({!/@x/}, :k)]; ' file
Sample Output:
b,v,n
x,u,v
t,m,m
Above is a solution using the Raku programming language, a member of the Perl family of programming languages.
Briefly, files are read line
-wise, skip
-ping the first (header) row, and map
is used to split
each line on ",
" (commas). Each csv
file is stored as an @
-sigiled Raku array.
After the two code lines of object creation, each array is distilled down into a single column of values. The @needle
array is distilled down to @x
while the @haystack
object is distilled down to @y
. To highlight Raku's flexibility (TIMTOWDI), two different ways of extracting the desired column are shown:
@x
uses the [Z]
"zip" operator to pull off elements of each row
such that the first elements of each row (i.e. the first column) are
extracted.@y
uses a numeric indexing sequence [0,3...*]
to pull out the
desired elements. Raku is smart enough to figure out the remaining
sequence values [0,3,6,9]
.Finally, Raku's grep
function is used to search for matches between @x
and @y
. The :k
(key) "adverb" parameter tells Raku to return the numeric indexing position of matches, or in this case--non-matches, because grep
uses a !{…}
Boolean block to negate the returned match values.
ADDENDUM: As @EdMorton has noted in the comments, the code above does not output column names (the first row of file1.csv
). That's easily remedied by prepending the following line (preferably making it the line before my @needle =
... .) :
"/path/to/file1.csv".IO.lines[0].put;
Upvotes: 0
Reputation: 203229
Using any awk in any shell on every Unix box:
$ awk -F, '
FNR==1 { if (NR==1) print; next }
NR==FNR { a[$1]; next }
!($1 in a)
' file2.csv file1.csv
col1,col2,col3
b,v,n
x,u,v
t,m,m
Upvotes: 1
Reputation: 2289
Using the R programming language:
> needle <- read.csv("/path/to/file2.csv", stringsAsFactors=FALSE)
> needle
col1 col2 col3
1 p m n
2 a z i
>
> haystack <- read.csv("/path/to/file1.csv", stringsAsFactors=FALSE)
> haystack
col1 col2 col3
1 a b c
2 b v n
3 x u v
4 t m m
>
> haystack[!(haystack$col1 %in% needle$col1), ]
col1 col2 col3
2 b v n
3 x u v
4 t m m
>
The above code should be pretty self-explanatory. For two-dimensional data in R, indexing is accomplished with comma-separated subscripts. As an example, for the dataframe csv_data[i,j]
the i
-indexing_value represents rows while the j
-indexing_value represents columns.
[Indexing in R
starts from number 1
, not 0
. Also, including the parameter stringsAsFactors=FALSE
to each of your read.csv()
function calls may be superfluous depending on the age of your R-installation].
The last line gives the desired result, and you can read the code somewhat literally to mean: "for all columns of the haystack dataframe find and return rows where haystack$col1
does not contain (i.e. not %in%
) needle$col1
values".
https://www.r-project.org/
https://cran.r-project.org/index.html
Upvotes: 0
Reputation: 179109
There's a join
command on Unix that does pretty much exactly what you want:
join -v1 -t, \
<(tail +2 file1.txt | sort -k1 -t,) \
<(tail +2 file2.txt | sort -k1 -t,)
For the sample files you've given, this is its output:
b,v,n
t,m,m
x,u,v
join -v1 -t,
-v1
: display rows from the 1st file that are not pairable with lines in the 2nd file via the join column (default column used for joining is 1, but it's overridable via the -1
and -2
options)-t,
: use comma as a field/column separator<(tail +2 file1.txt | sort -k1 -t,)
<( … )
: the join
command expects file names as arguments, so we use process substitution to create such temporary files from the output of the nested commandstail +2 file1.txt
: skip the header linesort -k1 -t,
: the join
command expects sorted files
-t,
: use comma as a field/column separator-k1
: sort by the 1st fieldUpvotes: 2