tech2nick
tech2nick

Reputation: 93

Joining two csv files in bash

I have to join two files by values in one column. I need to use unix bash.

My first file looks like this:

user_id, song_id, timestamp
00001638d6189236866af9bbf309ae6c2347ffdc,SOBBMDR12A8C13253B,1203083335
00001638d6189236866af9bbf309ae6c2347ffdc,SOBXALG12A8C13C108,984663773
00001cf0dce3fb22b0df0f3a1d9cd21e38385372,SODDNQT12A6D4F5F7E,1275071044
00001cf0dce3fb22b0df0f3a1d9cd21e38385372,SODDNQT12A6D4F5F7E,1097509573

Second file:

user_id, natural_key
00000b722001882066dff9d2da8a775658053ea0,6944471
00001638d6189236866af9bbf309ae6c2347ffdc,19309784
0000175652312d12576d9e6b84f600caa24c4715,10435505
00001cf0dce3fb22b0df0f3a1d9cd21e38385372,5232769

Of course both files have many more rows. I would like to join both files by first column (user_id) and get this result:

natural_key, song_id, timestamp
19309784,SOBBMDR12A8C13253B,1203083335
19309784,SOBXALG12A8C13C108,984663773
5232769,SODDNQT12A6D4F5F7E,1275071044
5232769,SODDNQT12A6D4F5F7E,1097509573

I tried to do something with join and awk but to no avail. Could anyone help?

Upvotes: 2

Views: 1121

Answers (3)

Cyrus
Cyrus

Reputation: 88646

With GNU join, sed, sort and bash:

echo "natural_key, song_id, timestamp"
join -t, <(sed '1d' file1 |sort -t, -k1,1) <(sed '1d' file2 | sort -t, -k1,1) -o 2.2,1.2,1.3

Output:

natural_key, song_id, timestamp
19309784,SOBBMDR12A8C13253B,1203083335
19309784,SOBXALG12A8C13C108,984663773
5232769,SODDNQT12A6D4F5F7E,1097509573
5232769,SODDNQT12A6D4F5F7E,1275071044

Upvotes: 4

agc
agc

Reputation: 8406

Using the mlr util:

mlr --csvlite  join -j user_id -f f1.csv \
         then  cut -o -f ' natural_key',' song_id',' timestamp' f2.csv

Output:

 natural_key, song_id, timestamp
19309784,SOBBMDR12A8C13253B,1203083335
19309784,SOBXALG12A8C13C108,984663773
5232769,SODDNQT12A6D4F5F7E,1275071044
5232769,SODDNQT12A6D4F5F7E,1097509573

Note the leading spaces in the headers. These are left intact here because:

  1. Most of the source data headers have leading spaces, but the data does not.

  2. The leading spaces, if unquoted, will fail with most CSV oriented utils.

Upvotes: 0

James Brown
James Brown

Reputation: 37404

This one in GNU awk (regex FS). That header spacing in your example I'm just going to ignore:

$ awk 'BEGIN{FS=", ?";OFS=","}NR==FNR{a[$1]=$2;next}$1 in a{print a[$1],$2,$3}' file2 file1
natural_key,song_id,timestamp
19309784,SOBBMDR12A8C13253B,1203083335
19309784,SOBXALG12A8C13C108,984663773
5232769,SODDNQT12A6D4F5F7E,1275071044
5232769,SODDNQT12A6D4F5F7E,1097509573

Explained:

$ awk '
BEGIN   { FS=", ?"; OFS="," }    # set the delimiters
NR==FNR { a[$1]=$2; next }       # hash the first file in paramaters 
$1 in a { print a[$1], $2, $3 }  # if key is found in hash, output
' file2 file1                    # mind the order

Upvotes: 1

Related Questions