Reputation: 2384
I decided to use Parquet as storage format for hive tables and before I actually implement it in my cluster, I decided to run some tests. Surprisingly, Parquet was slower in my tests as against the general notion that it is faster than plain text files.
Please be noted that I am using Hive-0.13 on MapR
----------------------------------------------------------
| | Table A | Table B | Table C | |
----------------------------------------------------------
| Format | Text | Parquet | Parquet | |
| Size[Gb] | 2.5 | 1.9 | 1.9 | |
| Comrepssion | N/A | N/A | Snappy | |
| CPU [sec] | 123.33 | 204.92 | N/A | Operation1 |
| Time [sec] | 59.057 | 50.33 | N/A | Operation1 |
| CPU [sec] | 51.18 | 117.08 | N/A | Operation2 |
| Time [sec] | 25.296 | 27.448 | N/A | Operation2 |
| CPU [sec] | 57.55 | 113.97 | N/A | Operation3 |
| Time [sec] | 20.254 | 27.678 | N/A | Operation3 |
| CPU [sec] | 57.55 | 113.97 | N/A | Operation4 |
| Time [sec] | 20.254 | 27.678 | N/A | Operation4 |
| CPU [sec] | 127.85 | 255.2 | N/A | Operation5 |
| Time [sec] | 29.68 | 41.025 | N/A | Operation5 |
You can see that in almost all the operations that I have applied on both the tables, Parquet is lagging behind in terms of time taken to execute the query with an exception of row count operation.
I also used table C to perform the aforementioned operations but the results were almost on similar lines with TextFile format again was snappier of the two.
Can some one please let me know what I am doing wrong?
Thanks!
EDIT
I added ORC to the list of storage formats and ran the tests again. Follows the details.
Row count operation
Text Format Cumulative CPU - 123.33 sec
Parquet Format Cumulative CPU - 204.92 sec
ORC Format Cumulative CPU - 119.99 sec
ORC with SNAPPY Cumulative CPU - 107.05 sec
Sum of a column operation
Text Format Cumulative CPU - 127.85 sec
Parquet Format Cumulative CPU - 255.2 sec
ORC Format Cumulative CPU - 120.48 sec
ORC with SNAPPY Cumulative CPU - 98.27 sec
Average of a column operation
Text Format Cumulative CPU - 128.79 sec
Parquet Format Cumulative CPU - 211.73 sec
ORC Format Cumulative CPU - 165.5 sec
ORC with SNAPPY Cumulative CPU - 135.45 sec
Selecting 4 columns from a given range using where clause
Text Format Cumulative CPU - 72.48 sec
Parquet Format Cumulative CPU - 136.4 sec
ORC Format Cumulative CPU - 96.63 sec
ORC with SNAPPY Cumulative CPU - 82.05 sec
Does that mean ORC is faster then Parquet? Or there is something that I can do to make it work better with query response time and compression ratio?
Thanks!
Upvotes: 7
Views: 3359
Reputation: 8562
First I would like to just point out that it is virtually impossible to answer your question with the given details.
Few points:
measuring time in a distributed environment is not the way to determine if something is slow (if you have many queries running and competing for resources you are not measuring what you think you are measuring)
not providing the actual table definition and the queries running against those tables makes this problem impossible to reproduce
not providing the number of rows of the table and the cardinality its individual fields is also not helping
In general, querying Parquet is much faster than querying text files because Parquet employs many things to make read operations much faster. Few of these things:
Depending on the use case some of the parameters of things can be tuned to the exact use case.
Upvotes: 0