Goutham Mullaguru
Goutham Mullaguru

Reputation: 41

How to implement efficient string interning in f#?

What is to implement a custom string type in f# for interning strings. i have to read large csv files into memory. Given most of the columns are categorical, values are repeating and it makes sense to create new string first time it is encountered and only refer to it on subsequent occurrences to save memory.

In c# I do this by creating a global intern pool (concurrent dict) and before setting a value, lookup the dictionary if it already exists. if it exists, just point to the string already in the dictionary. if not, add it to the dictionary and set the value to the string just added to dictionary.

New to f# and wondering what is the best way to do this in f#. will be using the new string type in records named tuples etc and it will have to work with concurrent processes.

Edit: String.Intern uses the Intern Pool. My understanding is, it is not very efficient for large pools and is not garbage collected i.e. any/all interned strings will remain in intern pool for lifetime of the app. Imagine a an application where you read a file, perform some operations and write data. Using Intern Pool solution will probably work. Now imagine you have to do the same 100 times and the strings in each file have little in common. If the memory is allocated on heap, after processing each file, we can force garbage collector to clear unnecessary strings.

I should have mentioned I could not really figure out how to do the C# approach in F# (other than implementing a C# type and using it in F#)

Memorisation pattern is slightly different from what I am looking for? We are not caching calculated results - we are ensuring each string object is created no more than once and all subsequent creations of same string are just references to the original. Using a dictionary to do this is a one way and using String.Intern is other.

sorry if is am missing something obvious here.

Upvotes: 4

Views: 326

Answers (1)

Bent Tranberg
Bent Tranberg

Reputation: 3470

I have a few things to say, so I'll post them as an answer.

First, I guess String.Intern works just as well in F# as in C#.

let x = "abc"
let y = StringBuilder("a").Append("bc").ToString()
printfn "1 : %A" (LanguagePrimitives.PhysicalEquality x y) // false
let y2 = String.Intern y
printfn "2 : %A" (LanguagePrimitives.PhysicalEquality x y2) // true

Second, are you using a dictionary in combination with String.Intern in your C# solution? If so, why not just do s = String.Intern(s); after the string is ready following input from file?

To create a type for use in your business domain to handle string deduplication in general is a very bad idea. You don't want your business domain polluted by that kind of low level stuff.

As for rolling your own. I did that some years ago, probably to avoid that problem you mentioned with the strings not being garbage collected, but I never tested if that actually was a problem.

It might be a good idea to use a dictionary (or something) for each column (or type of column) where the same values are likely to repeat in great numbers. (This is pretty much what you said already.)

It makes sense to only keep these dictionaries live while you read the information from file, and stuff it into internal data structures. You might be thinking that you need the dictionaries for subsequent reads, but I am not so sure about that.

  • The important thing is to deduplicate the great majority of strings, and not necessarily every single duplicate. Because of this you can greatly simplify the solution as indicated. You most probably have nothing to gain by overcomplicating your solution to squeeze out the last fraction of memory savings.
  • Releasing the dictionaries after the file is read and structures filled, will have the advantage of not holding on to strings when they are no longer really needed. And of course you save memory by not holding onto the dictionaries.

I see no need to handle concurrency issues in the implementation here. String.Intern must necessarily be immune to concurrency issues. If you roll your own with the design suggested, you would not use it concurrently. Each file being read would have its own set of dictionaries for its columns.

Upvotes: 1

Related Questions