Savannah Hay
Savannah Hay

Reputation: 7

Edit the characters in one file based on the characters of another file

I have an .ind file that looks like this:

    I001.HO M Ignore_Iran_Zoroastrian_PCA_outlier.HO
    I002.HO M Ignore_Iran_Zoroastrian_PCA_outlier.HO
    IREJ-T006.HO M Iran_Fars.HO
    IREJ-T009.HO M Iran_Fars.HO
    IREJ-T022.HO M Iran_Fars.HO
    IREJ-T023.HO M Iran_Fars.HO
    IREJ-T026.HO M Iran_Fars.HO
    IREJ-T027.HO M Iran_Fars.HO
    IREJ-T037.HO M Iran_Fars.HO
    IREJ-T040.HO M Iran_Fars.HO

And am trying to subset it to only certain individuals. So I have a list of my required individuals in a .txt file that looks like this:

   IREJ-T026.HO 
   IREJ-T027.HO
   IREJ-T037.HO
   IREJ-T040.HO

However the eigensoft subset with convertf only takes the population name not the individual name. Which as you can see individuals from the same population have the same population name (Iran_Fars.HO).

How can I go through the first file and look for only lines with the individuals listed in the second file and then append like "_B" to the end of the population name so those individuals in the list have a different population name than the others?

Thanks in advance!

I have been trying to use awk or sed in some way but I am a novice and I cannot figure it out

Upvotes: -3

Views: 68

Answers (2)

David C. Rankin
David C. Rankin

Reputation: 84607

If I understand correctly, this can be done simply in awk. If you have file.ind and it contains

    I001.HO M Ignore_Iran_Zoroastrian_PCA_outlier.HO
    I002.HO M Ignore_Iran_Zoroastrian_PCA_outlier.HO
    IREJ-T006.HO M Iran_Fars.HO
    IREJ-T009.HO M Iran_Fars.HO
    IREJ-T022.HO M Iran_Fars.HO
    IREJ-T023.HO M Iran_Fars.HO
    IREJ-T026.HO M Iran_Fars.HO
    IREJ-T027.HO M Iran_Fars.HO
    IREJ-T037.HO M Iran_Fars.HO
    IREJ-T040.HO M Iran_Fars.HO

And, you have file2.txt that contains:

   IREJ-T026.HO
   IREJ-T027.HO
   IREJ-T037.HO
   IREJ-T040.HO

And, you want to append _B to all records in file.ind with a first field that matches a record in file2.txt, you can do something similar to:

awk 'FNR == NR { a[$1]++; next } {if ($1 in a) { $0=$0 "_B" }; print}' file2.txt file.ind

How it Works

awk reads each file record-by-record (line-by-line) and it applies each rule [condition] { rule } to each record in the order you write each rule. Above you have two-rules:

  • FNR == NR { a[$1]++; next } which says for the first file (FNR file record number == NR total number of records) -- so 1st file (file2.txt in example), simply stores the fields in the file as indexes to an array a and skips to the next record,
  • {if ($1 in a) { $0=$0 "_B" }; print} now reading the file.ind (2nd file in example), if the first field of the record in file.ind matches any index of the array a, then add a _B to the record, print the record.

So you read all the records from file2.txt as indexes in the array a and if the array index matches the beginning of any record in file.ind you append a _B.

Simply redirect the output to a newfile to save the results, e.g. awk ... file2.txt file.ind > newfile.ind

Example Use/Output

$ awk 'FNR == NR { a[$1]++; next } {if ($1 in a) { $0=$0 "_B" }; print}' file2.txt file.ind
    I001.HO M Ignore_Iran_Zoroastrian_PCA_outlier.HO
    I002.HO M Ignore_Iran_Zoroastrian_PCA_outlier.HO
    IREJ-T006.HO M Iran_Fars.HO
    IREJ-T009.HO M Iran_Fars.HO
    IREJ-T022.HO M Iran_Fars.HO
    IREJ-T023.HO M Iran_Fars.HO
    IREJ-T026.HO M Iran_Fars.HO_B
    IREJ-T027.HO M Iran_Fars.HO_B
    IREJ-T037.HO M Iran_Fars.HO_B
    IREJ-T040.HO M Iran_Fars.HO_B

Let me know if that isn't what you are asking. I'm happy to help further. I'll edit and fill in how it works as well.

Upvotes: 1

Timur Shtatland
Timur Shtatland

Reputation: 12425

Use this Perl one-liner:

perl -lane 'BEGIN { %req = map { chomp; $_ => 1 } `cat required.txt`; } if ( $req{ $F[0] } ) { $F[2] .= "_B" } print join "\t", @F;' infile.ind > outfile.ind

The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array @F on whitespace or on the regex specified in -F option.

BEGIN { ... } : Execute the code inside this block before executing any of the implicit loops specified by the -n or other options.
%req = map { chomp; $_ => 1 } `cat required.txt`; : Read the entire required.txt file into hash %req, where the keys are each line (with terminal newline removed), and the values are 1.
if ( $req{ $F[0] } ) { $F[2] .= "_B" } : If the first field ($F[0]) was found in the %req hash, append _B to the third field ($F[2], or population).
print join "\t", @F; : Print all fields delimited by TAB.

See also:

Upvotes: 0

Related Questions