Geroge
Geroge

Reputation: 561

Paste data together in bash

I show you an example what I need to do with my data. I have two text files separated by tab.

cat in1.tsv

111 A B C
111 D E F
111 G H I
222 A B C
333 A B C
333 D E F

This table can have about thousands of rows. Number of columns is less than 100. First column can have repeated vaules (like 111 and 333).

cat in2.tsv

111 a b c 
222 a b c 
333 d e f

In this file are appear values in column 1 only once. I need to merge those two files according its first column match.

cat output.tsv

111 A B C 111 a b c
111 D E F 111 a b c
111 G H I 111 a b c
222 A B C 222 a b c 
333 A B C 333 d e f
333 D E F 333 d e f 

My solution works if the size of matrix are the same:

paste  <(sort in1.tsv) <(sort in2.tsv) > output.tsv

I am appreciate any help in awk, bash or another programs that works fast for lot of rows.

Upvotes: 3

Views: 1062

Answers (5)

Eugene Yarmash
Eugene Yarmash

Reputation: 149873

In Python, not relying on the files being sorted:

#!/usr/bin/env python

with open("in1.tsv") as in1, open("in2.tsv") as in2:
    d = {line.split()[0]: line for line in in2}
    for line in in1:
        print(line.strip(), d[line.split()[0]], sep="\t", end="")

This basically creates a mapping from values of the first column to lines of in2.tsv, then loops over the lines of in1.tsv and combines them with the corresponding lines of in2.tsv using the mapping.

Upvotes: 2

potong
potong

Reputation: 58440

This might work for you (GNU sed):

 sed -r 's#^(\S+)\s.*#/^\1/s/$/ &/#' file2 | sed -f - file

Create a sed script from the second file. This script consists of a regexp that when matched appends the matched record from the second file to matched record from the first.

Upvotes: 2

ikkuh
ikkuh

Reputation: 4603

The join command seems to almost do what you want:

$ join in1.tsv in2.tsv
111 A B C a b c
111 D E F a b c
111 G H I a b c
222 A B C a b c
333 A B C d e f
333 D E F d e f

Default behavior is to join lines based on first column with whitespace as delimiter. Using the format option -o gives us the same result. Sorting is also required as Dmitry Polonskiy says in the comments:

join -o 1.1,1.2,1.3,1.4,2.1,2.2,2.3,2.4 <(sort in1.tsv) <(sort in2.tsv)

Upvotes: 2

gold_cy
gold_cy

Reputation: 14226

Here is a bash approach:

First let's sort each file:

LC_ALL=C sort init1.tsv -S75% -t$'\t' -k1,1 > init1.tsv.sorted

LC_ALL=C sort init2.tsv -S75% -t$'\t' -k1,1 > init2.tsv.sorted

Then instead of pasting lets join them by the first column,

join init1.tsv.sorted init2.tsv.sorted -1 1 -2 2 -t$'\t'

If you need a specific sort of join, this seems like a left outer join, then I would do this:

join init1.tsv.sorted init2.tsv.sorted -1 1 -2 2 -t$'\t' -a1

A quick note, -S specifies how much RAM you want to use, the faster you want this operation to go, the more you should use.

Upvotes: 2

Inian
Inian

Reputation: 85693

Awk to the rescue!

awk 'BEGIN{FS=OFS="\t"}FNR==NR{for(i=2;i<=NF;i++) map[$1]=(map[$1] FS $i); next}$1 in map{print $0,$1,map[$1]}' in2.tsv in1.tsv

produces the output in the tab-separated format as you expected. Remove the OFS="\t" if you don't want the o/p tab separation.

As far as the logic, create a map containing the values per column 1 on in2.csv into a hash-map map[] and then on in1.csv pick those lines containing $1 same as from the map formed and print the line contents.

Upvotes: 3

Related Questions