Rahul
Rahul

Reputation: 2384

Why is querying Parquet files is slower than text files in Hive?

I decided to use Parquet as storage format for hive tables and before I actually implement it in my cluster, I decided to run some tests. Surprisingly, Parquet was slower in my tests as against the general notion that it is faster than plain text files.

Please be noted that I am using Hive-0.13 on MapR

----------------------------------------------------------
|             | Table A | Table B | Table C |            |
----------------------------------------------------------
| Format      | Text    | Parquet | Parquet |            |
| Size[Gb]    | 2.5     | 1.9     | 1.9     |            |
| Comrepssion | N/A     | N/A     | Snappy  |            |
| CPU [sec]   | 123.33  | 204.92  | N/A     | Operation1 |
| Time [sec]  | 59.057  | 50.33   | N/A     | Operation1 |
| CPU [sec]   | 51.18   | 117.08  | N/A     | Operation2 |
| Time [sec]  | 25.296  | 27.448  | N/A     | Operation2 |
| CPU [sec]   | 57.55   | 113.97  | N/A     | Operation3 |
| Time [sec]  | 20.254  | 27.678  | N/A     | Operation3 |
| CPU [sec]   | 57.55   | 113.97  | N/A     | Operation4 |
| Time [sec]  | 20.254  | 27.678  | N/A     | Operation4 |
| CPU [sec]   | 127.85  | 255.2   | N/A     | Operation5 |
| Time [sec]  | 29.68   | 41.025  | N/A     | Operation5 |

You can see that in almost all the operations that I have applied on both the tables, Parquet is lagging behind in terms of time taken to execute the query with an exception of row count operation.

I also used table C to perform the aforementioned operations but the results were almost on similar lines with TextFile format again was snappier of the two.

Can some one please let me know what I am doing wrong?

Thanks!

EDIT

I added ORC to the list of storage formats and ran the tests again. Follows the details.

Row count operation

Text Format Cumulative CPU - 123.33 sec

Parquet Format Cumulative CPU - 204.92 sec

ORC Format Cumulative CPU - 119.99 sec

ORC with SNAPPY Cumulative CPU - 107.05 sec

Sum of a column operation

Text Format Cumulative CPU - 127.85 sec

Parquet Format Cumulative CPU - 255.2 sec

ORC Format Cumulative CPU - 120.48 sec

ORC with SNAPPY Cumulative CPU - 98.27 sec

Average of a column operation

Text Format Cumulative CPU - 128.79 sec

Parquet Format Cumulative CPU - 211.73 sec

ORC Format Cumulative CPU - 165.5 sec

ORC with SNAPPY Cumulative CPU - 135.45 sec

Selecting 4 columns from a given range using where clause

Text Format Cumulative CPU - 72.48 sec

Parquet Format Cumulative CPU - 136.4 sec

ORC Format Cumulative CPU - 96.63 sec

ORC with SNAPPY Cumulative CPU - 82.05 sec

Does that mean ORC is faster then Parquet? Or there is something that I can do to make it work better with query response time and compression ratio?

Thanks!

Upvotes: 7

Views: 3359

Answers (1)

Istvan
Istvan

Reputation: 8562

First I would like to just point out that it is virtually impossible to answer your question with the given details.

Few points:

  • measuring time in a distributed environment is not the way to determine if something is slow (if you have many queries running and competing for resources you are not measuring what you think you are measuring)

  • not providing the actual table definition and the queries running against those tables makes this problem impossible to reproduce

  • not providing the number of rows of the table and the cardinality its individual fields is also not helping

In general, querying Parquet is much faster than querying text files because Parquet employs many things to make read operations much faster. Few of these things:

  • compression
  • run length encoding
  • dictionary encoding

Depending on the use case some of the parameters of things can be tuned to the exact use case.

Upvotes: 0

Related Questions