hoytpr
hoytpr

Reputation: 5

How to grep different strings followed by a tab on multiple lines

Apologies, I could not find an answer that worked for me. Working on a Win10 machine with Cygwin and/or Gitbash, I have a file of sequence read names ("readsfile") followed by other information all separated by tabs. The reads file looks like this:

NB501827:133:HMV5HAFX2:1:11101:3747:1066    75  NODE_622711+_length_75_cov_990.55   100.000 43
NB501827:133:HMV5HAFX2:1:11101:8852:1068    74  NODE_622752+_length_4244_cov_356.337    100.000 74

I want to simply use grep to parse out the read names up to the first tab of each line, outputting the results to a separate file "readnames.txt". Not including the "tab" character would be a plus, but is fixable later. The output file "readnames.txt" should be:

NB501827:133:HMV5HAFX2:1:11101:3747:1066
NB501827:133:HMV5HAFX2:1:11101:3747:1066
NB501827:133:HMV5HAFX2:1:11101:8852:1068

(For now, the duplicated read names are okay) I have tried a multitude of solutions found on this site. Some examples taking into account grep vs egrep vs grep -E, vs Perl grep include:

grep -oE $'^*\t' readsfile > readnames.txt
egrep '^NB*\t' readsfile > readnames.txt
grep -oE '^NB'$'\t' readsfile > readnames.txt
grep -oP 'NB*\t' readsfile > readnames.txt
grep -o $'NB*\t' readsfile > readnames.txt
grep -oE ^NB*$'\t' readsfile > readnames.txt
grep -o '[NB*|[[:space:]]]' readsfile > readnames.txt
grep -o ^NB*[[:space:]] readsfile > readnames.txt
grep -o $"NB*$'\t'" readsfile > readnames.txt
grep -o \<NB*\> readsfile > readnames.txt

Note that I have also used scripts to include "actual" tabs using <Cntrl-V><tab> or grep -oE '^NB* ' readsfile > readnames.txt or grep -oE '^NB.* ' readsfile > readnames.txt in most of the combinations used at the command line.

Also some other unsuccessful solutions:

sed -n 's/NB*\t/&/p' readsfile > readnames.txt
sed -n 's/*\t/&/p' readsfile > readnames.txt

I suspect this has been done but help is needed. Thank-you.

Upvotes: 0

Views: 475

Answers (2)

James K. Lowden
James K. Lowden

Reputation: 7837

Further to what zzevannn says, awk is your friend whenever you have column delimited by something that can be described with a regular expression.

In this particular case,

$ awk '{print $1}' filename > newfile

does what you want. By default, awk separates fields on whitespace, and your first field has no whitespace, so you're golden. If it did, you could use awk's -F option to tell it to break on a tab, or something else.

The clue to your difficulty for me was

I want to simply use grep to parse out the read names

grep(1) will find things, but it won't parse: it won't change the line or output part of it.

To print the 1st field only the first time it's encountered, I would use

$ awk '!a[$1]++ {print $1}' filename > newfile

or

$ awk '{print $1}' filename | uniq > newfile

or

$ awk '{print $1}' filename | sort | uniq > newfile

or

$ awk '{print $1}' filename | sort -u > newfile

depending on the phase of the moon. ;-)

Upvotes: 1

zzevannn
zzevannn

Reputation: 3724

If you want everything past the first tab removed including the tab, this sed would do that sed 's/\t.*//g'

Alternatively, sed 's/\([^\t]*\)\t.*/\1/g' finds any non-tab character repeated any number of times, followed by a tab and any number of characters, captures the bit up to the first tab, and spits that out.

awk handles tab delimited input very well too. awk -F'\t' '!a[$1]++ {print}' will print out the deduped first field (delimited by tabs) for each line. This works by inserting and incrementing the value of an array keyed off the first field, so the first time it's encountered it evaluates to !0, so print is fired, and each subsequent time a value is seen it will be !1, !2, etc, evaluating to false and not printing.

Upvotes: 1

Related Questions