HKJ3
HKJ3

Reputation: 477

How to cross check values in one column in a file in another file and print the missing values?

I have a file that looks like this. It has 12 columns and 3244343 rows. Lets call this file 1.

variant_id  gene_id tss_distance    ma_samples  ma_count    maf pval_nominal    slope   slope_se    pval_nominal_threshold  min_pval_nominal    pval_beta
chr10_100000235_C_T_b38 ENSG00000227232.5   35211   73  74  0.061157    1.69779e-08 0.510322    0.0890939   0.0006160191.01823e-08  1.17701e-05
chr10_100002628_A_C_b38 ENSG00000227232.5   635545  126 130 0.107438    1.01823e-08 0.405406    0.0696647   0.0006160191.01823e-08  1.17701e-05
chr1_666028_G_A_b38 ENSG00000227232.5   636475  111 115 0.0950413   2.78462e-08 0.411513    0.0729864   0.0006160191.01823e-08  1.17701e-0

I have another file that looks like this with 7 headers and with 1633293 rows. This is file 2.

"variant_id" "hg38_chr" "hg38_pos" "ref_allele" "alt_allele" "hg19_chr" "hg19_pos"
"chr10_100000235_C_T_b38" "chr10" "100000235" "C" "T" "chr10" 101759992
"chr10_100002628_A_C_b38" "chr10" "100002628" "A" "C" "chr10" 101762385
"chr10_100004827_A_C_b38" "chr10" "100004827" "A" "C" "chr10" 101764584
"chr10_100005358_G_C_b38" "chr10" "100005358" "G" "C" "chr10" 101765115

I am only interested in the variant_id column. This is the first column in both files.

How can I compare these two columns and only print those variant_id values in the first column that are not found in the second file. For the above example, the output should be

chr1_666028_G_A_b38 

because it's found in the first file but not the second.

All the values of the variant_id in the second file are also in the first file. But there are extra IDs in the first file not present in the second, which I want to identify.

I am using the command below:

$ awk '{print $1}' file2 > file2.names
$ grep -vf file2.names file1

However, the output is just all the values of variant_id in file 1.

Upvotes: 1

Views: 584

Answers (4)

RARE Kpop Manifesto
RARE Kpop Manifesto

Reputation: 2865

An awk-based solution that ensures the missing ones are only printed once :

  • (tested on mawk gawk and nawk)

|

awk -F'[ \t\"]' 'NF = (_=NR==FNR) ? __[$++_]++*!_ : ___[$!_]--==__[$!_]' f2 f1
chr1_666028_G_A_b38

alternative less-elegant but single-array approach:

awk 'NF = (_=NR==FNR)? __[$++_]++*!_ : (_=NR) == -(__[$!!_]-=_)' f2 f1

Upvotes: 1

Ed Morton
Ed Morton

Reputation: 203995

Using any POSIX awk:

$ awk -F'[[:space:]"]' '
    NR==FNR { a[$2]; next }
    !($1 in a) { print $1 }
' file2 file1
chr1_666028_G_A_b38

Upvotes: 3

anubhava
anubhava

Reputation: 785541

You may use this awk:

awk 'FNR == NR {gsub(/"/, "", $1); a[$1]; next}
!($1 in a) {print $1}' file2 file1

chr1_666028_G_A_b38

Upvotes: 3

RavinderSingh13
RavinderSingh13

Reputation: 133610

1st solution: With your shown samples please try following awk code, written and tested in GNU awk. Both the codes will print 1st field from file1 if its NOT present in 2nd field, its taking care of addiotnal " present before and after $1 in file2.

awk '
FNR==1 { next }
FNR==NR{
  gsub(/^"|"$/,"",$1)
  arr[$1]
  next
}
!($1 in arr){
  print $1
}
' file2 file1


2nd solution OR without using gsub as used in above awk code a slight different from above one, where I am not doing substitution of " values present before and after in $1, I am simply wrapping $1 of file1 and checking if that is present in file2 or not here.

awk '
FNR==1 { next }
FNR==NR{
  arr[$1]
  next
}
!("\""$1"\"" in arr){
  print $1
}
' file2 file1

Upvotes: 3

Related Questions