Reputation: 3542
I have two large tab separated files A.tsv and B.tsv, they look like (the header is not in the file):
A.tsv:
ID AGE
User1 18
...
B.tsv:
ID INCOME
User4 49000
...
I want to select list of IDs in A such that 10=< AGE <=20 and select rows in B that match the list. And I want to use GNU parallel tool. My attempt is two steps:
cat A.tsv | parallel --pipe -q awk '{ if ($3 >= 10 && $3 <= 20) print $1}' > list.tsv
cat list.tsv | parallel --pipe -q xargs -I% awk 'FNR==NR{a[$1];next}($1 in a)' % B.tsv > result.tsv
The first step works but the second one comes with error like:
awk: cannot open User1 (No such file or directory)
How can I fix this? Does this method work even if A.tsv and list.tsv are 2 to 3 times bigger than the memory?
Upvotes: 4
Views: 2148
Reputation: 1
I know this: (yes, I saw it) GNU parallel used with xargs and awk Asked 8 years, 3 months ago Modified 8 years, 3 months ago Viewed 2k times
My solution: only xargs and awk, only a line without intermediate file, and you don't need install a new tool
awk '{if ($2 >= 10 && $2 <= 20) print $1}' A.tsv | xargs -I myItem awk --assign quebuscar=myItem '$1==quebuscar {print}' B.tsv
Upvotes: 0
Reputation: 177
$ for I in $(seq 8 2 22); do echo -e "User$I\t$I" >> A.txt; done; cat A.txt
User8 8
User10 10
User12 12
User14 14
User16 16
User18 18
User20 20
User22 22
$ for I in $(seq 8 2 22); do echo -e "User$I\t100${I}00" >> B.txt; done; cat B.txt
User8 100800
User10 1001000
User12 1001200
User14 1001400
User16 1001600
User18 1001800
User20 1002000
User22 1002200
$ cat A.txt | parallel --pipe -q awk '{if ($2 >= 10 && $2 <= 20) print $1}' > list.txt
$ cat B.txt | parallel --pipe -q grep -f list.txt
User10 1001000
User12 1001200
User14 1001400
User16 1001600
User18 1001800
User20 1002000
Upvotes: 4