Why is querying Parquet files is slower than text files in Hive?

Question

I decided to use Parquet as storage format for hive tables and before I actually implement it in my cluster, I decided to run some tests. Surprisingly, Parquet was slower in my tests as against the general notion that it is faster than plain text files.

Please be noted that I am using Hive-0.13 on MapR

----------------------------------------------------------
|             | Table A | Table B | Table C |            |
----------------------------------------------------------
| Format      | Text    | Parquet | Parquet |            |
| Size[Gb]    | 2.5     | 1.9     | 1.9     |            |
| Comrepssion | N/A     | N/A     | Snappy  |            |
| CPU [sec]   | 123.33  | 204.92  | N/A     | Operation1 |
| Time [sec]  | 59.057  | 50.33   | N/A     | Operation1 |
| CPU [sec]   | 51.18   | 117.08  | N/A     | Operation2 |
| Time [sec]  | 25.296  | 27.448  | N/A     | Operation2 |
| CPU [sec]   | 57.55   | 113.97  | N/A     | Operation3 |
| Time [sec]  | 20.254  | 27.678  | N/A     | Operation3 |
| CPU [sec]   | 57.55   | 113.97  | N/A     | Operation4 |
| Time [sec]  | 20.254  | 27.678  | N/A     | Operation4 |
| CPU [sec]   | 127.85  | 255.2   | N/A     | Operation5 |
| Time [sec]  | 29.68   | 41.025  | N/A     | Operation5 |

Operation1: Row count operation
Operation2: Single Row Selection
Operation3: Multi Row Selection Using Where clause [1000 rows fetched]
Operation4: Multi Row Selection [with only 4 columns] Using Where clause [1000 rows fetched]
Operation5: Aggregation operation [Using sum function on a given column]

You can see that in almost all the operations that I have applied on both the tables, Parquet is lagging behind in terms of time taken to execute the query with an exception of row count operation.

I also used table C to perform the aforementioned operations but the results were almost on similar lines with TextFile format again was snappier of the two.

Can some one please let me know what I am doing wrong?

Thanks!

EDIT

I added ORC to the list of storage formats and ran the tests again. Follows the details.

Row count operation

Text Format Cumulative CPU - 123.33 sec

Parquet Format Cumulative CPU - 204.92 sec

ORC Format Cumulative CPU - 119.99 sec

ORC with SNAPPY Cumulative CPU - 107.05 sec

Sum of a column operation

Text Format Cumulative CPU - 127.85 sec

Parquet Format Cumulative CPU - 255.2 sec

ORC Format Cumulative CPU - 120.48 sec

ORC with SNAPPY Cumulative CPU - 98.27 sec

Average of a column operation

Text Format Cumulative CPU - 128.79 sec

Parquet Format Cumulative CPU - 211.73 sec

ORC Format Cumulative CPU - 165.5 sec

ORC with SNAPPY Cumulative CPU - 135.45 sec

Selecting 4 columns from a given range using where clause

Text Format Cumulative CPU - 72.48 sec

Parquet Format Cumulative CPU - 136.4 sec

ORC Format Cumulative CPU - 96.63 sec

ORC with SNAPPY Cumulative CPU - 82.05 sec

Does that mean ORC is faster then Parquet? Or there is something that I can do to make it work better with query response time and compression ratio?

Thanks!

Why is querying Parquet files is slower than text files in Hive?

Answers (1)

Related Questions