azuric
azuric

Reputation: 2839

Row based access using ParquetSharp library in C# which is based on apache-parquet-cpp (Arrow)

Does anyone know how row based read access to a parquet file using ParquetSharp is performed? This is where I have got to but the inputStream throws an cannot convert to string error.

using (var buffer = new ResizableBuffer())
{
    using (var reader = new ParquetFileReader(@"C:\Users\X\Documents\X.parquet"))
    {
        using (var inputStream = new BufferReader(buffer))
        {
            using (var readerRow = ParquetFile.CreateRowReader<Tuple>(inputStream))
            {
            }
        }
    }
}

Also ParquetSharp uses TTuple but I cannot find any definition for it anywhere.

I know parquet is column based so this is not the most efficient method to read but it is convenient for my work.

Regards

Upvotes: 0

Views: 1900

Answers (1)

Tanguy Fautr&#233;
Tanguy Fautr&#233;

Reputation: 298

The row-oriented API of ParquetSharp uses reflection to discover the public fields of the given row structure or class. TTuple is just a generic parameter, a placeholder for the row type.

It works with custom structures or classes, System.Tuple and System.ValueTuple. You can see a few examples in https://github.com/G-Research/ParquetSharp/blob/master/csharp.test/TestRowOrientedParquetFile.cs

To take your example, you would define your expected row type:

internal struct MyStruct
{
    public readonly int FirstField;
    public readonly string SecondField;
}

And then somewhere in your method:

using (var reader = ParquetFile.CreateRowReader<MyStruct>(@"C:\Users\X\Documents\X.parquet"))
{
    /* read rows */
}

Although I personally prefer using C# 7 tuples, saving you the trouble to have to give your own struct definition in the first place. The only downside is when writing a Parquet file, ParquetSharp cannot automatically infer the column names from the field names (internally both System.Tuple and System.ValueTuple have got boring field names such as Item1, Item2, etc).

using (var reader = ParquetFile.CreateRowReader<(int firstField, string secondField)>(@"C:\Users\X\Documents\X.parquet"))
{
    /* read rows */
}

Upvotes: 1

Related Questions