Reputation: 93
I have to join
two files by values in one column. I need to use unix bash
.
My first file looks like this:
user_id, song_id, timestamp
00001638d6189236866af9bbf309ae6c2347ffdc,SOBBMDR12A8C13253B,1203083335
00001638d6189236866af9bbf309ae6c2347ffdc,SOBXALG12A8C13C108,984663773
00001cf0dce3fb22b0df0f3a1d9cd21e38385372,SODDNQT12A6D4F5F7E,1275071044
00001cf0dce3fb22b0df0f3a1d9cd21e38385372,SODDNQT12A6D4F5F7E,1097509573
Second file:
user_id, natural_key
00000b722001882066dff9d2da8a775658053ea0,6944471
00001638d6189236866af9bbf309ae6c2347ffdc,19309784
0000175652312d12576d9e6b84f600caa24c4715,10435505
00001cf0dce3fb22b0df0f3a1d9cd21e38385372,5232769
Of course both files have many more rows. I would like to join
both files by first column (user_id) and get this result:
natural_key, song_id, timestamp
19309784,SOBBMDR12A8C13253B,1203083335
19309784,SOBXALG12A8C13C108,984663773
5232769,SODDNQT12A6D4F5F7E,1275071044
5232769,SODDNQT12A6D4F5F7E,1097509573
I tried to do something with join
and awk
but to no avail. Could anyone help?
Upvotes: 2
Views: 1121
Reputation: 88646
With GNU join, sed, sort and bash:
echo "natural_key, song_id, timestamp"
join -t, <(sed '1d' file1 |sort -t, -k1,1) <(sed '1d' file2 | sort -t, -k1,1) -o 2.2,1.2,1.3
Output:
natural_key, song_id, timestamp 19309784,SOBBMDR12A8C13253B,1203083335 19309784,SOBXALG12A8C13C108,984663773 5232769,SODDNQT12A6D4F5F7E,1097509573 5232769,SODDNQT12A6D4F5F7E,1275071044
Upvotes: 4
Reputation: 8406
Using the mlr
util:
mlr --csvlite join -j user_id -f f1.csv \
then cut -o -f ' natural_key',' song_id',' timestamp' f2.csv
Output:
natural_key, song_id, timestamp
19309784,SOBBMDR12A8C13253B,1203083335
19309784,SOBXALG12A8C13C108,984663773
5232769,SODDNQT12A6D4F5F7E,1275071044
5232769,SODDNQT12A6D4F5F7E,1097509573
Note the leading spaces in the headers. These are left intact here because:
Most of the source data headers have leading spaces, but the data does not.
The leading spaces, if unquoted, will fail with most CSV oriented utils.
Upvotes: 0
Reputation: 37404
This one in GNU awk (regex FS
). That header spacing in your example I'm just going to ignore:
$ awk 'BEGIN{FS=", ?";OFS=","}NR==FNR{a[$1]=$2;next}$1 in a{print a[$1],$2,$3}' file2 file1
natural_key,song_id,timestamp
19309784,SOBBMDR12A8C13253B,1203083335
19309784,SOBXALG12A8C13C108,984663773
5232769,SODDNQT12A6D4F5F7E,1275071044
5232769,SODDNQT12A6D4F5F7E,1097509573
Explained:
$ awk '
BEGIN { FS=", ?"; OFS="," } # set the delimiters
NR==FNR { a[$1]=$2; next } # hash the first file in paramaters
$1 in a { print a[$1], $2, $3 } # if key is found in hash, output
' file2 file1 # mind the order
Upvotes: 1