Qgenerator
Qgenerator

Reputation: 323

What file format will let me search for strings in the file extremely quickly?

I have a 100GB file of random strings of text between 4 and 200 characters long, one on each line.

I want to be able to find either a string within any string in the file e.g. any occurrence of "test" in "footestbar", if that's possible.

Otherwise I'd be happy being able to find lines/records that start with with a substring e.g. "foo" finds "footestbar" but not "testbarfoo".

I was thinking of sorting the file once and then recording the positions where lines with "a" start, where lines with "b" start, etc. This would let me quickly jump to the right section and reduce the time it takes. I could improve further by recording the positions where all three character combinations start to make it even faster, but something tells me there's a better way.

Upvotes: 2

Views: 64

Answers (1)

Cristóbal Ganter
Cristóbal Ganter

Reputation: 119

I think a good start could be to generate a DAFSA. You probably will have to combine it with a graph file format.

Upvotes: 1

Related Questions