Reputation: 7
I have an .ind file that looks like this:
I001.HO M Ignore_Iran_Zoroastrian_PCA_outlier.HO
I002.HO M Ignore_Iran_Zoroastrian_PCA_outlier.HO
IREJ-T006.HO M Iran_Fars.HO
IREJ-T009.HO M Iran_Fars.HO
IREJ-T022.HO M Iran_Fars.HO
IREJ-T023.HO M Iran_Fars.HO
IREJ-T026.HO M Iran_Fars.HO
IREJ-T027.HO M Iran_Fars.HO
IREJ-T037.HO M Iran_Fars.HO
IREJ-T040.HO M Iran_Fars.HO
And am trying to subset it to only certain individuals. So I have a list of my required individuals in a .txt file that looks like this:
IREJ-T026.HO
IREJ-T027.HO
IREJ-T037.HO
IREJ-T040.HO
However the eigensoft subset with convertf only takes the population name not the individual name. Which as you can see individuals from the same population have the same population name (Iran_Fars.HO).
How can I go through the first file and look for only lines with the individuals listed in the second file and then append like "_B" to the end of the population name so those individuals in the list have a different population name than the others?
Thanks in advance!
I have been trying to use awk or sed in some way but I am a novice and I cannot figure it out
Upvotes: -3
Views: 68
Reputation: 84607
If I understand correctly, this can be done simply in awk
. If you have file.ind
and it contains
I001.HO M Ignore_Iran_Zoroastrian_PCA_outlier.HO
I002.HO M Ignore_Iran_Zoroastrian_PCA_outlier.HO
IREJ-T006.HO M Iran_Fars.HO
IREJ-T009.HO M Iran_Fars.HO
IREJ-T022.HO M Iran_Fars.HO
IREJ-T023.HO M Iran_Fars.HO
IREJ-T026.HO M Iran_Fars.HO
IREJ-T027.HO M Iran_Fars.HO
IREJ-T037.HO M Iran_Fars.HO
IREJ-T040.HO M Iran_Fars.HO
And, you have file2.txt
that contains:
IREJ-T026.HO
IREJ-T027.HO
IREJ-T037.HO
IREJ-T040.HO
And, you want to append _B
to all records in file.ind
with a first field that matches a record in file2.txt
, you can do something similar to:
awk 'FNR == NR { a[$1]++; next } {if ($1 in a) { $0=$0 "_B" }; print}' file2.txt file.ind
How it Works
awk
reads each file record-by-record (line-by-line) and it applies each rule [condition] { rule }
to each record in the order you write each rule. Above you have two-rules:
FNR == NR { a[$1]++; next }
which says for the first file (FNR
file record number == NR
total number of records) -- so 1st file (file2.txt
in example), simply stores the fields in the file as indexes to an array a
and skips to the next
record,{if ($1 in a) { $0=$0 "_B" }; print}
now reading the file.ind
(2nd file in example), if the first field of the record in file.ind
matches any index of the array a
, then add a _B
to the record, print the record.So you read all the records from file2.txt
as indexes in the array a
and if the array index matches the beginning of any record in file.ind
you append a _B
.
Simply redirect the output to a newfile to save the results, e.g. awk ... file2.txt file.ind > newfile.ind
Example Use/Output
$ awk 'FNR == NR { a[$1]++; next } {if ($1 in a) { $0=$0 "_B" }; print}' file2.txt file.ind
I001.HO M Ignore_Iran_Zoroastrian_PCA_outlier.HO
I002.HO M Ignore_Iran_Zoroastrian_PCA_outlier.HO
IREJ-T006.HO M Iran_Fars.HO
IREJ-T009.HO M Iran_Fars.HO
IREJ-T022.HO M Iran_Fars.HO
IREJ-T023.HO M Iran_Fars.HO
IREJ-T026.HO M Iran_Fars.HO_B
IREJ-T027.HO M Iran_Fars.HO_B
IREJ-T037.HO M Iran_Fars.HO_B
IREJ-T040.HO M Iran_Fars.HO_B
Let me know if that isn't what you are asking. I'm happy to help further. I'll edit and fill in how it works as well.
Upvotes: 1
Reputation: 12425
Use this Perl one-liner:
perl -lane 'BEGIN { %req = map { chomp; $_ => 1 } `cat required.txt`; } if ( $req{ $F[0] } ) { $F[2] .= "_B" } print join "\t", @F;' infile.ind > outfile.ind
The Perl one-liner uses these command line flags:
-e
: Tells Perl to look for code in-line, instead of in a file.
-n
: Loop over the input one line at a time, assigning it to $_
by default.
-l
: Strip the input line separator ("\n"
on *NIX by default) before executing the code in-line, and append it when printing.
-a
: Split $_
into array @F
on whitespace or on the regex specified in -F
option.
BEGIN { ... }
: Execute the code inside this block before executing any of the implicit loops specified by the -n
or other options.
%req = map { chomp; $_ => 1 } `cat required.txt`;
: Read the entire required.txt
file into hash %req
, where the keys are each line (with terminal newline removed), and the values are 1
.
if ( $req{ $F[0] } ) { $F[2] .= "_B" }
: If the first field ($F[0]
) was found in the %req
hash, append _B
to the third field ($F[2]
, or population).
print join "\t", @F;
: Print all fields delimited by TAB.
Upvotes: 0