TiGi
TiGi

Reputation: 37

Read multiple files with Spark java from Alluxio is slow

I have installed Alluxio on local with Spark and I have inserted 1000 files in the memory of Alluxio.
Nevertheless read file is very slow. File-reading time from Alluxio memory is equal file-reading time from disk. I don't understand why.

File Name   Size    Block Size  In-Memory   Persistence State   Pin Creation Time   Modification Time
file1   54.73KB 512.00MB     100%   NOT_PERSISTED   NO  08-16-2016 12:52:31:278 08-16-2016 12:52:31:372
file2   54.73KB 512.00MB     100%   NOT_PERSISTED   NO  08-16-2016 12:52:31:377 08-16-2016 12:52:31:384
file3   54.72KB 512.00MB     100%   NOT_PERSISTED   NO  08-16-2016 12:52:31:386 08-16-2016 12:52:31:393
file4   54.71KB 512.00MB     100%   NOT_PERSISTED   NO  08-16-2016 12:52:31:394 08-16-2016 12:52:31:400
file5   54.72KB 512.00MB     100%   NOT_PERSISTED   NO  08-16-2016 12:52:31:401 08-16-2016 12:52:31:407
...

I read data with file API :

FileSystem fs = FileSystem.Factory.get();
AlluxioURI path = new AlluxioURI(/partition0);
List<URIStatus> status = fs.listStatus(path);
for (int i=0; i<status.size(); i++)
                    {
                        path = new AlluxioURI(status.get(i).getPath());
                        if(fs.exists(path)==true)
                        {
                            FileInStream in = fs.openFile(path);
                            String file = "";

                            InputStreamReader ipsr = new InputStreamReader(in);

                            BufferedReader br=new BufferedReader(ipsr);
                            String line;
                            line=br.readLine();
                            while (line != null){
                                //System.out.println(line);

                                file = file + line;
                                line=br.readLine(); 
                            }

                            byte[] cfv = file.getBytes();
                            br.close();
                            // Close file relinquishing the lock
                            in.close();
                        }
                    }

I don't use Spark for now because the test to read a partition with 1000 files is very slow... (I want read file by partition with Spark in the future).

Why read time using this method/library so slow ?

Upvotes: 0

Views: 780

Answers (2)

TiGi
TiGi

Reputation: 37

After some test, the size of file is the main problem in the reading-time. Small files can multiply by 20 and more the reading time. The size of block affect also the reading-time, it can increase about 1% the reading time.

Upvotes: 0

RobV
RobV

Reputation: 28655

There are a couple of things that look a bit off in your example.

Firstly the information you show on your files suggest that the files are very small at about 50 kB each but you have Alluxio configured to use 512 MB blocks. This potentially means that you are transferring far more data than you actually need to. So one thing to consider is that if you intend to primarily have small files you would be better off configuring for a much smaller block size.

Secondly the way you actually read the file in your test case is horribly inefficient. You are reading line by line as a string, using string concatenation to build up the file which you then convert back into bytes. So you are going from bytes in memory, to strings and then back to bytes. Plus by using string concatenation you are forcing the whole of the file read so far to be copied in memory technique additional line you read.

Typically you will either read the file line by line into a StringBuilder/ writing to another Writer or you would read the file as bytes into a byte[]/ writing to another OutputStream e.g. ByteArrayOutputStream if you want to ultimately get a byte[] and don't know the size in advance.

The third consideration is where your code runs within your cluster. Even if the files are in memory they may not be in memory on every node in the cluster. If you read the files from a node where they are not yet in memory then they have to be read across the network at which point performance will be reduced.

The final consideration is OS file caching. If you generated your test files and then ran your test immediately then those files are likely cached in memory by the OS. At which point you will get as good if not better performance than Alluxio because the caching is at the OS level. If you really want to make a meaningful comparison then you need to ensure that you flush your OS file caches before running any file-based tests.

Upvotes: 2

Related Questions