jianfeng.mao
jianfeng.mao

Reputation: 945

how to subset a file - select a numbers of rows or columns

I would like to have your advice/help on how to subset a big file (millions of rows or lines).

For example,

(1) I have big file (millions of rows, tab-delimited). I want to a subset of this file with only rows from 10000 to 100000.

(2) I have big file (millions of columns, tab-delimited). I want to a subset of this file with only columns from 10000 to 100000.

I know there are tools like head, tail, cut, split, and awk or sed. I can use them to do simple subsetting. But, I do not know how to do this job.

Could you please give any advice? Thanks in advance.

Upvotes: 25

Views: 57157

Answers (4)

Warren
Warren

Reputation: 41

For the first problem, selecting a set of rows from a large file, piping tail to head is very simple. You want 90000 rows from largefile starting at row 10000. tail grabs the back end of largefile starting at row 10000 and then head chops off all but the first 90000 rows.

tail -n +10000 largefile | head -n 90000 -

Upvotes: 4

Fredrik Pihl
Fredrik Pihl

Reputation: 45662

Was beaten to it for the sed solution, so I'll post a perl dito instead. To print selected lines.

$ seq 100 | perl -ne 'print if $. >= 10 && $. <= 20' 
10
11
12
13
14
15
16
17
18
19
20

To print selective columns, use

perl -lane 'print $F[1] .. $F[3] '

-F is used in conjunction with -a, to choose the delimiter on which to split lines.

To test, use seq and paste to get generate some columns

$ seq 50 | paste - - - - -
1   2   3   4   5
6   7   8   9   10
11  12  13  14  15
16  17  18  19  20
21  22  23  24  25
26  27  28  29  30
31  32  33  34  35
36  37  38  39  40
41  42  43  44  45
46  47  48  49  50

Lets's print everything except the first and the last column

$ seq 50 | paste - - - - - | perl -lane 'print join "   ", $F[1] .. $F[3]'
2   3   4
7   8   9
12  13  14
17  18  19
22  23  24
27  28  29
32  33  34
37  38  39
42  43  44
47  48  49

In the join statement above, there is a tab, you get it by doing a ctrl-v tab.

Upvotes: 3

Drakosha
Drakosha

Reputation: 12165

Filtering rows is easy, for example with AWK:

cat largefile | awk 'NR >= 10000  && NR <= 100000 { print }'

Filtering columns is easier with CUT:

cat largefile | cut -d '\t' -f 10000-100000

As Rahul Dravid mentioned, cat is not a must here, and as Zsolt Botykai added you can improve performance using:

awk 'NR > 100000 { exit } NR >= 10000 && NR <= 100000' largefile
cut -d '\t' -f 10000-100000 largefile 

Upvotes: 34

Vijay
Vijay

Reputation: 67221

Some different solutions:

For row ranges: In sed :

sed -n 10000,100000p somefile.txt

For column ranges in awk:

awk -v f=10000 -v t=100000 '{ for (i=f; i<=t;i++) printf("%s%s", $i,(i==t) ? "\n" : OFS) }' details.txt

Upvotes: 18

Related Questions