Reputation: 87
a.txt contains 500,000 columns and 2000 rows. The example file below only shows the first 9 columns in this file. This file has header in the first row.
chromosome SNPID rsid position alleleA alleleB 2409086 3514581 3635346
1 1:55487346_C_G rs12117661 55487346 C G 1 0 0
1 1:55487648_A_G rs11588151 55487648 A G 1 0 0
1 1:55489542_C_T rs34232196 55489542 C T 1 0 0
1 1:55490861_T_C rs4500361 55490861 T C 1 0 0
1 1:55491702_T_C rs4927191 55491702 T C 0.894118 0 0
1 1:55491780_A_G rs200159426 55491780 A G 0.894118 0 0
b.txt contains 45000 columns which show the column name for each line. I want to extract the columns from a.txt according to b.txt.
chromosome
SNPID
rsid
position
alleleA
alleleB
2409086
3635346
c.txt is my expected outcome. c.txt should be a space separated table with 45000 columns and 2000 rows.
chromosome SNPID rsid position alleleA alleleB 2409086 3635346
1 1:55487346_C_G rs12117661 55487346 C G 1 0
1 1:55487648_A_G rs11588151 55487648 A G 1 0
1 1:55489542_C_T rs34232196 55489542 C T 1 0
1 1:55490861_T_C rs4500361 55490861 T C 1 0
1 1:55491702_T_C rs4927191 55491702 T C 0.894118 0
1 1:55491780_A_G rs200159426 55491780 A G 0.894118 0
I tried to use cut
to solve this problem, but it shows that argument list too long (since I need to extract 45000 columns). I know awk may solve this problem but I am not familiar with awk and did not find any answer about that. Does any body have solution for it?
cut -f 1,$(
head -n1 a.txt |
tr ' ' '\n' |
grep -nf b.txt |
sed ':a;$!N;s/:[^\n]*\n/,/;ta;s/:.*//'
) a.txt > c.txt
-bash: /usr/bin/cut: Argument list too long
Update: Thank you guys for the great solutions.
Upvotes: 6
Views: 179
Reputation: 17178
You can use Miller (available here for many OSs) for this task.
Unfortunately, b.txt
is not in the format that Miller expects, so you'll first have to transpose it to a single-line CSV, for example with:
edit: thanks @glennjackman for the simpler solution:
paste -s -d ',' b.txt > b.CSV
Now you just have to pass it to Miller as "template" file:
mlr --csvlite --ifs=' ' template -t b.CSV a.txt
remark: I tried to provide the template file as a process substitution <(...)
instead of a regular file but it doesn't work.
Upvotes: 1
Reputation: 2855
you don't need loops or python or anything messy :
I created a random file, and wanted to get the column names ending with lower case letters, which generated a custom awk
command on the fly by only reading the first column of the data file, and a single pass for both the input files.
echo
gdate
echo
f='testtest_colnames.txt'
gcat -n "${f}"
___="$( mawk ' BEGIN { printf("{ print ")
} END { printf(" }")
} /[a-z]$/ { printf("%.*s $%.u", !!_++, ",", NR) }' "${f}" )"
gawk -p- -be "${___}" /dev/null
Tue Dec 6 06:26:17 EST 2022
1 QRSTUV
2 ABCDEF
3 OPQRST
4 cdefgh
5 IJKLMN
6 ghijkl
7 efghij
8 uvwxyz
9 qrstuv
10 wxyzAB
11 stuvwx
12 wxyz
13 KLMNOP
14 yzABCD
15 ijklmn
16 MNOPQR
17 abcdef
18 UVWXYZ
19 YZabcd
20 opqrst
21 klmnop
22 mnopqr
23 STUVWX
24 GHIJKL
25 EFGHIJ
26 WXYZab
27 CDEFGH
# gawk profile, created Tue Dec 6 06:26:17 2022
# Rule(s)
{
print $4, $6, $7, $8, $9, $11, $12, $15, $17, $19, $20, $21, $22, $26
}
Upvotes: 0
Reputation: 36590
I suggest trying GNU datamash
if you are allowed to install and use tools other than awk
, though I am not sure how it would deal with such big file and so many columns, it has cut
operation named so after cut
command, but unlike it does understand headers and keep order, simple example let file.txt
content be
alleleA alleleB alleleC
A C G
T A C
G T A
then
datamash --field-separator=' ' --headers cut "alleleC,alleleA" < file.txt
gives output
cut(alleleC) cut(alleleA)
G A
C T
A G
Observe that order provided was applied and cut
appeared in output headers as this was action done, if this is not acceptable you might elect to remove cut(
and )
using e.g. sed
as long as there are not brackets in column names.
(tested in GNU datamash 1.7)
Upvotes: 0
Reputation: 203899
Using any awk:
$ cat tst.awk
NR == FNR {
out2tag[++numOutFlds] = $1
next
}
FNR==1 {
for ( inFldNr=1; inFldNr<=NF; inFldNr++ ) {
tag2in[$inFldNr] = inFldNr
}
}
{
for ( outFldNr=1; outFldNr<=numOutFlds; outFldNr++ ) {
tag = out2tag[outFldNr]
inFldNr = tag2in[tag]
printf "%s%s", $inFldNr, (outFldNr<numOutFlds ? OFS : ORS)
}
}
$ awk -f tst.awk b.txt a.txt
chromosome SNPID rsid position alleleA alleleB 2409086 3635346
1 1:55487346_C_G rs12117661 55487346 C G 1 0
1 1:55487648_A_G rs11588151 55487648 A G 1 0
1 1:55489542_C_T rs34232196 55489542 C T 1 0
1 1:55490861_T_C rs4500361 55490861 T C 1 0
1 1:55491702_T_C rs4927191 55491702 T C 0.894118 0
1 1:55491780_A_G rs200159426 55491780 A G 0.894118 0
Three things to note with this approach:
b.txt
so you don't have to produce output in the same order as the input fields in a.txt
. This is the main functional difference between the above and @GlennJackmans 2nd solution, theirs will also be very slightly faster as mine requires 1 extra hash lookup per field.Upvotes: 2
Reputation: 247002
With awk
Suppose this is filter.awk
NR == FNR { # reading the first file
wanted[$1] = 1
next
}
FNR == 1 {
for (i=1; i<=NF; i++) {
header[i] = $i
}
}
{
for (i=1; i<=NF; i++) {
if (header[i] in wanted) {
printf "%s ", $i
}
}
print ""
}
Then, given your sample a.txt and
$ cat b.txt
chromosome
rsid
2409086
we get
$ awk -f filter.awk b.txt a.txt
chromosome rsid 2409086
1 rs12117661 1
1 rs11588151 1
1 rs34232196 1
1 rs4500361 1
1 rs4927191 0.894118
1 rs200159426 0.894118
...
This will be a bit quicker: it doesn't have to iterate over all the columns for each record
NR == FNR { # reading the first file
wanted[$1] = 1
next
}
FNR == 1 {
n = 0
for (i=1; i<=NF; i++) {
if ($i in wanted) {
cols_to_print[++n] = i
}
}
}
{
for (i=1; i<=n; i++) printf "%s ", $(cols_to_print[i])
print ""
}
Upvotes: 4