Reputation: 11
I have two large files. One file has a list of lines that I am grepping in another file. I found that the following code works great to pull out matching lines in File2 along with the 3 lines after the match in a short test file.
When I try to run the same code on large files, 15 million lines for File1 and 63 million lines for File2, the process is taking, understandably, a very long time.
Is there anyway to do this faster?
Code:
grep -A 3 -xf File1 File2
File1 data example:
@M02465:48:000000000-A94WY:1:1101:15033:1350 1:N:0:0
@M02465:48:000000000-A94WY:1:1101:16062:1339 1:N:0:0
@M02465:48:000000000-A94WY:1:1101:15860:1331 1:N:0:0
@M02465:48:000000000-A94WY:1:1101:15810:1334 1:N:0:0
File2 data example:
@M02465:48:000000000-A94WY:1:1101:15860:1331 1:N:0:0
TGAGTCACTGGT
+
BBCBBFFFFFFD
@M02465:48:000000000-A94WY:1:1101:15655:1332 1:N:0:0
TCCGACACAATT
+
ABB3ADDBFAFF
@M02465:48:000000000-A94WY:1:1101:15831:1332 1:N:0:0
GACTTGGTATTC
+
A111>1C113B@
@M02465:48:000000000-A94WY:1:1101:15598:1332 1:N:0:0
CCTCGTTCGACT
+
BCCCCDDFCBCD
@M02465:48:000000000-A94WY:1:1101:15810:1334 1:N:0:0
GCTGCTGAGCAT
+
>111111BF111
@M02465:48:000000000-A94WY:1:1101:15895:1334 1:N:0:0
CCTCGTTCGACT
+
>A1>>1>C11?>
@M02465:48:000000000-A94WY:1:1101:16015:1334 1:N:0:0
AATCAGTCTCGT
+
AAAA?@B@BD1>
@M02465:48:000000000-A94WY:1:1101:15715:1335 1:N:0:0
AATCAGTCTCGT
+
BCBCCFFFFFFC
@M02465:48:000000000-A94WY:1:1101:15455:1335 1:N:0:0
AGGCTACACGAC
+
AABAAFFFFBBB
Upvotes: 1
Views: 114
Reputation: 5861
Use fgrep
instead of grep
. Much faster if you don't need to match regular expressions:
fgrep -A 3 -xf File1 File2
Making sure your pattern file File1
doesn't contain duplicates might help too:
sort -u File1 >File1_new
Upvotes: 3