dMac
dMac

Reputation: 11

Return n number of lines after grep match from large files

I have two large files. One file has a list of lines that I am grepping in another file. I found that the following code works great to pull out matching lines in File2 along with the 3 lines after the match in a short test file.

When I try to run the same code on large files, 15 million lines for File1 and 63 million lines for File2, the process is taking, understandably, a very long time.

Is there anyway to do this faster?

Code:

grep -A 3 -xf File1 File2

File1 data example:

@M02465:48:000000000-A94WY:1:1101:15033:1350 1:N:0:0
@M02465:48:000000000-A94WY:1:1101:16062:1339 1:N:0:0
@M02465:48:000000000-A94WY:1:1101:15860:1331 1:N:0:0
@M02465:48:000000000-A94WY:1:1101:15810:1334 1:N:0:0

File2 data example:

@M02465:48:000000000-A94WY:1:1101:15860:1331 1:N:0:0
TGAGTCACTGGT
+
BBCBBFFFFFFD
@M02465:48:000000000-A94WY:1:1101:15655:1332 1:N:0:0
TCCGACACAATT
+
ABB3ADDBFAFF
@M02465:48:000000000-A94WY:1:1101:15831:1332 1:N:0:0
GACTTGGTATTC
+
A111>1C113B@
@M02465:48:000000000-A94WY:1:1101:15598:1332 1:N:0:0
CCTCGTTCGACT
+
BCCCCDDFCBCD
@M02465:48:000000000-A94WY:1:1101:15810:1334 1:N:0:0
GCTGCTGAGCAT
+
>111111BF111
@M02465:48:000000000-A94WY:1:1101:15895:1334 1:N:0:0
CCTCGTTCGACT
+
>A1>>1>C11?>
@M02465:48:000000000-A94WY:1:1101:16015:1334 1:N:0:0
AATCAGTCTCGT
+
AAAA?@B@BD1>
@M02465:48:000000000-A94WY:1:1101:15715:1335 1:N:0:0
AATCAGTCTCGT
+
BCBCCFFFFFFC
@M02465:48:000000000-A94WY:1:1101:15455:1335 1:N:0:0
AGGCTACACGAC
+
AABAAFFFFBBB

Upvotes: 1

Views: 114

Answers (1)

lcd047
lcd047

Reputation: 5861

Use fgrep instead of grep. Much faster if you don't need to match regular expressions:

fgrep -A 3 -xf File1 File2

Making sure your pattern file File1 doesn't contain duplicates might help too:

sort -u File1 >File1_new

Upvotes: 3

Related Questions