Manojkumar K
Manojkumar K

Reputation: 41

Filtering CSV File using AWK

I'm working on CSV file

This my csv file

Command used for filtering awk -F"," '{print $14}' out_file.csv > test1.csv

This is an example of my data looks like i have around 43 Row and 12,000 column i planed to separate the single Row using awk command but i cant able to separate the row 3 alone (disease).

i use the following command to get my output

awk -F"," '{print $3}' out_file.csv > test1.csv

This is my file:

gender|gene_name  |disease         |1000g_oct2014|Polyphen |SNAP 
male  |RB1,GTF2A1L|cancer,diabetes |0.1          |0.46     |0.1  
male  |NONE,LOC441|diabetes        |0.003        |0.52     |0.6  
male  |TBC1D1     |diabetes        |0.940        |1        |0.9  
male  |BCOR       |cancer          |0            |0.31     |0.2  
male  |TP53       |diabetes        |0            |0.54     |0.4  

note "|" i did not use this a delimiter. it for show the row in an order my details looks exactly like this in the spreed sheet:

enter image description here

But i'm getting the output following way

Disease
GTF2A1L
LOC441
TBC1D1
BCOR
TP53

While opening in Spread Sheet i can get the results in the proper manner but when i uses awk the , in-between the row 2 is also been taken. i dont know why can any one help me with this.

Upvotes: 0

Views: 2148

Answers (3)

Manojkumar K
Manojkumar K

Reputation: 41

At last this is what i did for getting my answers in a simple way thanks to @peak i found the solution

1st i used the CSV filter which is an python module used for filtering the csv file. i changed my delimiters using csvfilter using the following command

csvfilter input_file.csv --out-delimiter="|" > out_file.csv

This command used to change the delimiter ',' into '|' now i used the awk command to sort and filter

awk -F"|" 'FNR == 1 {print} {if ($14 < 0.01) print }' out_file.csv > filtered_file.csv

Thanks for your help.

Upvotes: 0

peak
peak

Reputation: 116690

Unfortunately the link you provided ("This is my file") points to two files, neither of which (at the time of this writing) seems to correspond with the sample you gave. However, if your file really is a CSV file with commas used both for separating fields and embedded within fields, then the advice given elsewhere to use a CSV-aware tool is very sound. (I would recommend considering a command-line program that can convert CSV to TSV so the entire *nix tool chain remains at your disposal.)

Your sample output and attendant comments suggest you may already have a way to convert it to a pipe-delimited or tab-delimited file. If so, then awk can be used quite effectively. (If you have a choice, then I'd suggest tabs, since then programs such as cut are especially easy to use.)

The general idea, then, is to use awk with "|" (or tab) as the primary separator (awk -F"|" or awk -F\\t), and to use awk's split function to parse the contents of each top-level field.

Upvotes: 0

Sobrique
Sobrique

Reputation: 53478

The root of your problem is - you have comma separated values with embedded commas.

That makes life more difficult. I would suggest the approach is to use a csv parser.

I quite like perl and Text::CSV:

#!/usr/bin/env perl
use strict;
use warnings;

use Text::CSV;

open ( my $data, '<', 'data_file.csv' ) or die $!; 

my $csv = Text::CSV -> new ( { binary => 1, sep_char => ',', eol => "\n" } );

while ( my $row = $csv -> getline ( $data ) ) {
   print $row -> [2],"\n"; 
}

Of course, I can't tell for sure if that actually works, because the data you've linked on your google drive doesn't actually match the question you've asked. (note - perl starts arrays at zero, so [3] is actually the 4th field)

But it should do the trick - Text::CSV handles quoted comma fields nicely.

Upvotes: 1

Related Questions