Reputation: 5
Apologies, I could not find an answer that worked for me. Working on a Win10 machine with Cygwin and/or Gitbash, I have a file of sequence read names ("readsfile") followed by other information all separated by tabs. The reads file looks like this:
NB501827:133:HMV5HAFX2:1:11101:3747:1066 75 NODE_622711+_length_75_cov_990.55 100.000 43
NB501827:133:HMV5HAFX2:1:11101:8852:1068 74 NODE_622752+_length_4244_cov_356.337 100.000 74
I want to simply use grep
to parse out the read names up to the first tab of each line, outputting the results to a separate file "readnames.txt". Not including the "tab" character would be a plus, but is fixable later. The output file "readnames.txt" should be:
NB501827:133:HMV5HAFX2:1:11101:3747:1066
NB501827:133:HMV5HAFX2:1:11101:3747:1066
NB501827:133:HMV5HAFX2:1:11101:8852:1068
(For now, the duplicated read names are okay) I have tried a multitude of solutions found on this site. Some examples taking into account grep vs egrep vs grep -E, vs Perl grep include:
grep -oE $'^*\t' readsfile > readnames.txt
egrep '^NB*\t' readsfile > readnames.txt
grep -oE '^NB'$'\t' readsfile > readnames.txt
grep -oP 'NB*\t' readsfile > readnames.txt
grep -o $'NB*\t' readsfile > readnames.txt
grep -oE ^NB*$'\t' readsfile > readnames.txt
grep -o '[NB*|[[:space:]]]' readsfile > readnames.txt
grep -o ^NB*[[:space:]] readsfile > readnames.txt
grep -o $"NB*$'\t'" readsfile > readnames.txt
grep -o \<NB*\> readsfile > readnames.txt
Note that I have also used scripts to include "actual" tabs using <Cntrl-V><tab>
or grep -oE '^NB* ' readsfile > readnames.txt
or grep -oE '^NB.* ' readsfile > readnames.txt
in most of the combinations used at the command line.
Also some other unsuccessful solutions:
sed -n 's/NB*\t/&/p' readsfile > readnames.txt
sed -n 's/*\t/&/p' readsfile > readnames.txt
I suspect this has been done but help is needed. Thank-you.
Upvotes: 0
Views: 475
Reputation: 7837
Further to what zzevannn says, awk is your friend whenever you have column delimited by something that can be described with a regular expression.
In this particular case,
$ awk '{print $1}' filename > newfile
does what you want. By default, awk separates fields on whitespace, and your first field has no whitespace, so you're golden. If it did, you could use awk's -F option to tell it to break on a tab, or something else.
The clue to your difficulty for me was
I want to simply use grep to parse out the read names
grep(1) will find things, but it won't parse: it won't change the line or output part of it.
To print the 1st field only the first time it's encountered, I would use
$ awk '!a[$1]++ {print $1}' filename > newfile
or
$ awk '{print $1}' filename | uniq > newfile
or
$ awk '{print $1}' filename | sort | uniq > newfile
or
$ awk '{print $1}' filename | sort -u > newfile
depending on the phase of the moon. ;-)
Upvotes: 1
Reputation: 3724
If you want everything past the first tab removed including the tab, this sed would do that sed 's/\t.*//g'
Alternatively, sed 's/\([^\t]*\)\t.*/\1/g'
finds any non-tab character repeated any number of times, followed by a tab and any number of characters, captures the bit up to the first tab, and spits that out.
awk
handles tab delimited input very well too. awk -F'\t' '!a[$1]++ {print}'
will print out the deduped first field (delimited by tabs) for each line. This works by inserting and incrementing the value of an array keyed off the first field, so the first time it's encountered it evaluates to !0, so print is fired, and each subsequent time a value is seen it will be !1, !2, etc, evaluating to false and not printing.
Upvotes: 1