Reputation: 37
I have installed Alluxio on local with Spark and I have inserted 1000 files in the memory of Alluxio.
Nevertheless read file is very slow.
File-reading time from Alluxio memory is equal file-reading time from disk.
I don't understand why.
File Name Size Block Size In-Memory Persistence State Pin Creation Time Modification Time
file1 54.73KB 512.00MB 100% NOT_PERSISTED NO 08-16-2016 12:52:31:278 08-16-2016 12:52:31:372
file2 54.73KB 512.00MB 100% NOT_PERSISTED NO 08-16-2016 12:52:31:377 08-16-2016 12:52:31:384
file3 54.72KB 512.00MB 100% NOT_PERSISTED NO 08-16-2016 12:52:31:386 08-16-2016 12:52:31:393
file4 54.71KB 512.00MB 100% NOT_PERSISTED NO 08-16-2016 12:52:31:394 08-16-2016 12:52:31:400
file5 54.72KB 512.00MB 100% NOT_PERSISTED NO 08-16-2016 12:52:31:401 08-16-2016 12:52:31:407
...
I read data with file API :
FileSystem fs = FileSystem.Factory.get();
AlluxioURI path = new AlluxioURI(/partition0);
List<URIStatus> status = fs.listStatus(path);
for (int i=0; i<status.size(); i++)
{
path = new AlluxioURI(status.get(i).getPath());
if(fs.exists(path)==true)
{
FileInStream in = fs.openFile(path);
String file = "";
InputStreamReader ipsr = new InputStreamReader(in);
BufferedReader br=new BufferedReader(ipsr);
String line;
line=br.readLine();
while (line != null){
//System.out.println(line);
file = file + line;
line=br.readLine();
}
byte[] cfv = file.getBytes();
br.close();
// Close file relinquishing the lock
in.close();
}
}
I don't use Spark for now because the test to read a partition with 1000 files is very slow... (I want read file by partition with Spark in the future).
Why read time using this method/library so slow ?
Upvotes: 0
Views: 780
Reputation: 37
After some test, the size of file is the main problem in the reading-time. Small files can multiply by 20 and more the reading time. The size of block affect also the reading-time, it can increase about 1% the reading time.
Upvotes: 0
Reputation: 28655
There are a couple of things that look a bit off in your example.
Firstly the information you show on your files suggest that the files are very small at about 50 kB each but you have Alluxio configured to use 512 MB blocks. This potentially means that you are transferring far more data than you actually need to. So one thing to consider is that if you intend to primarily have small files you would be better off configuring for a much smaller block size.
Secondly the way you actually read the file in your test case is horribly inefficient. You are reading line by line as a string, using string concatenation to build up the file which you then convert back into bytes. So you are going from bytes in memory, to strings and then back to bytes. Plus by using string concatenation you are forcing the whole of the file read so far to be copied in memory technique additional line you read.
Typically you will either read the file line by line into a StringBuilder
/ writing to another Writer
or you would read the file as bytes into a byte[]
/ writing to another OutputStream
e.g. ByteArrayOutputStream
if you want to ultimately get a byte[]
and don't know the size in advance.
The third consideration is where your code runs within your cluster. Even if the files are in memory they may not be in memory on every node in the cluster. If you read the files from a node where they are not yet in memory then they have to be read across the network at which point performance will be reduced.
The final consideration is OS file caching. If you generated your test files and then ran your test immediately then those files are likely cached in memory by the OS. At which point you will get as good if not better performance than Alluxio because the caching is at the OS level. If you really want to make a meaningful comparison then you need to ensure that you flush your OS file caches before running any file-based tests.
Upvotes: 2