Reputation: 561
I show you an example what I need to do with my data. I have two text files separated by tab.
cat in1.tsv
111 A B C
111 D E F
111 G H I
222 A B C
333 A B C
333 D E F
This table can have about thousands of rows. Number of columns is less than 100. First column can have repeated vaules (like 111 and 333).
cat in2.tsv
111 a b c
222 a b c
333 d e f
In this file are appear values in column 1 only once. I need to merge those two files according its first column match.
cat output.tsv
111 A B C 111 a b c
111 D E F 111 a b c
111 G H I 111 a b c
222 A B C 222 a b c
333 A B C 333 d e f
333 D E F 333 d e f
My solution works if the size of matrix are the same:
paste <(sort in1.tsv) <(sort in2.tsv) > output.tsv
I am appreciate any help in awk, bash or another programs that works fast for lot of rows.
Upvotes: 3
Views: 1062
Reputation: 149873
In Python, not relying on the files being sorted:
#!/usr/bin/env python
with open("in1.tsv") as in1, open("in2.tsv") as in2:
d = {line.split()[0]: line for line in in2}
for line in in1:
print(line.strip(), d[line.split()[0]], sep="\t", end="")
This basically creates a mapping from values of the first column to lines of in2.tsv
, then loops over the lines of in1.tsv
and combines them with the corresponding lines of in2.tsv
using the mapping.
Upvotes: 2
Reputation: 58440
This might work for you (GNU sed):
sed -r 's#^(\S+)\s.*#/^\1/s/$/ &/#' file2 | sed -f - file
Create a sed script from the second file. This script consists of a regexp that when matched appends the matched record from the second file to matched record from the first.
Upvotes: 2
Reputation: 4603
The join
command seems to almost do what you want:
$ join in1.tsv in2.tsv
111 A B C a b c
111 D E F a b c
111 G H I a b c
222 A B C a b c
333 A B C d e f
333 D E F d e f
Default behavior is to join lines based on first column with whitespace as delimiter. Using the format option -o
gives us the same result. Sorting is also required as Dmitry Polonskiy says in the comments:
join -o 1.1,1.2,1.3,1.4,2.1,2.2,2.3,2.4 <(sort in1.tsv) <(sort in2.tsv)
Upvotes: 2
Reputation: 14226
Here is a bash
approach:
First let's sort each file:
LC_ALL=C sort init1.tsv -S75% -t$'\t' -k1,1 > init1.tsv.sorted
LC_ALL=C sort init2.tsv -S75% -t$'\t' -k1,1 > init2.tsv.sorted
Then instead of pasting
lets join
them by the first column,
join init1.tsv.sorted init2.tsv.sorted -1 1 -2 2 -t$'\t'
If you need a specific sort of join, this seems like a left outer join, then I would do this:
join init1.tsv.sorted init2.tsv.sorted -1 1 -2 2 -t$'\t' -a1
A quick note, -S
specifies how much RAM you want to use, the faster you want this operation to go, the more you should use.
Upvotes: 2
Reputation: 85693
Awk
to the rescue!
awk 'BEGIN{FS=OFS="\t"}FNR==NR{for(i=2;i<=NF;i++) map[$1]=(map[$1] FS $i); next}$1 in map{print $0,$1,map[$1]}' in2.tsv in1.tsv
produces the output in the tab-separated format as you expected. Remove the OFS="\t"
if you don't want the o/p tab separation.
As far as the logic, create a map containing the values per column 1 on in2.csv
into a hash-map map[]
and then on in1.csv
pick those lines containing $1
same as from the map formed and print the line contents.
Upvotes: 3