Ramirous
Ramirous

Reputation: 149

Replacing strings in one file with strings from second file

I've been searching for a couple of days but I haven't got the right answer

I have two files that look like this:

File1:

>contig-100_23331 length_200 read_count_4043 
TCAG...
>contig-100_23332 length_200 read_count_4508 
TTCA...
>contig-100_23333 length_200 read_count_184 
TTCC...

File2:

>contig-100_23331_Cov:_30.9135
>contig-100_23332_Cov:_125.591
>contig-100_23333_Cov:_5.97537

I want to replace the lines with the names (>contig... length...) in File1 with the lines with the names in File2. Note that File2 contains only the contig names (no sequence).

I suppose theres a way with sed, but I can't find the solution

Thanks in advance!

Upvotes: 1

Views: 124

Answers (2)

Jonathan Leffler
Jonathan Leffler

Reputation: 753805

One possibility is to use sed to create a sed-script from File2 that is then used on File1:

sed 's/^\(>contig-[0-9]*_[0-9]*\)_.*/s%^\1 %& %/' File2 > sed.script
sed -f sed.script File1 > File.Out
rm -f sed.script

For the sample File2, the sed.script would contain:

s%^>contig-100_23331 %>contig-100_23331_Cov:_30.9135 %
s%^>contig-100_23332 %>contig-100_23332_Cov:_125.591 %
s%^>contig-100_23333 %>contig-100_23333_Cov:_5.97537 %

For the sample File1, the output of the sed processing would be:

>contig-100_23331_Cov:_30.9135 length_200 read_count_4043 
TCAG...
>contig-100_23332_Cov:_125.591 length_200 read_count_4508 
TTCA...
>contig-100_23333_Cov:_5.97537 length_200 read_count_184 
TTCC...

Some versions of sed may have problems with 23k lines in the sed script. If that's a problem for you, then you can generate the sed.script and then split it (split) into smaller chunks (e.g. 1000 lines each) and then run sed -f chunk for each of the chunks. That's painful, but necessary. Historically, HP-UX (archaic versions, like HP-UX 9 or 10) had rather limited versions of sed that could only handle a few hundred commands in the sed script.

Given that you're using bash, you can avoid the explicit intermediate file with process substitution:

sed -f <(sed 's/^\(>contig-[0-9]*_[0-9]*\)_.*/s%^\1 %& %/' File2) File1 > File.Out

However, you should validate the script before using that notation.

Upvotes: 2

Nabheet
Nabheet

Reputation: 1314

DISCLAIMER: Never done this ...

You might want to use the join command to merge the files merging files

You may have to produce an intermediary file or stream for FILE2 which has an extra empty line so that two lines match in both files.

Hope this helps.

Upvotes: 0

Related Questions