Siscia
Siscia

Reputation: 1491

Improve throughput writing a lot of small files in C

I want to improve the throughput of a software that writes, several usually small, files into a network attached volume.

The volume is limited to 100 IOPS and 80 MB/s of bandwidth.

At the moment I saturate the 100 IOPS but the bandwidth is very very far from the 80 MB/s reachable, ~4 MB/s but even less.

I believe that the main issue is that we make a lot of small requests, those small requests saturated the IOPS but the bandwidth is pretty much left unexploited.

The software is written in C and I control pretty much everything down to the actual write syscall.

At the moment the architecture is multithreading, with several threads working as "spoolers" and making synchronous write call, each for a different file.

So suppose we have file a, b and c and thread t1, t2 and t3.

t1 will open a and call in a loop something like write(fd_a, buff_a, 1024) and the same will do t2 (write(fd_b, buff_b, 1024)) and t3 (write(fd_c, buff_c, 1024)).

Each file is a new file, so it get created at the first write.

I believe that the problem is that the requests the OS is making (after the Linux IO scheduler merge) are pretty small, in the order of 10/20 blocks (5/10 kilobyte) each.

The only way I see to fix the issue is to make bigger requests, but each file is small so I am not quite sure what is the best way forward.

A possible idea could be to make a single write request instead of a loop of several request, so lookup how big is the file, allocate enough memory, populate the buffer and finally execute a single write.

Another idea could be to switch so async io, but I don't have understand what the advantages would be in this case.

Do you have any other suggestion?

Upvotes: 3

Views: 494

Answers (1)

Garrigan Stafford
Garrigan Stafford

Reputation: 1403

You can put all the files into a tar-archive in memory. Then you can write the tar archive as a large request and then unzip the tar archive as a separate process which frees up the writing program.

Here is an idea that is a bit more "creative". First put the files into groups based on where they are being saved (possibly by directory). Then find the largest file in the group. Pad content of each other file so that each file is the same size. Then append the files to each other so now you have one large file. Send that write request. So now we have one large file written that contains a lot of equally sized smaller files. So use the linux split command to split the file into the multiple original files (https://kb.iu.edu/d/afar). This could work but you have to be ok with having padding at the end of files.

EDIT: It is important to note that these solutions are not scalable. The long term solution would be what @AndrewHenle suggested in the comments.

Upvotes: 1

Related Questions