Reputation: 6465

How to read a csv file using pyarrow in python

I have made a connection to my HDFS using the following command

import pyarrow as pa
import pyarrow.parquet as pq

fs = pa.hdfs.connect(self.namenode, self.port, user=self.username, kerb_ticket = self.cert)

I'm using the following command to read a parquet file

fs.read_parquet()

but there is not read method for regular text files (e.g. a csv file). How can I read a csv file using pyarrow.

Upvotes: 2

Answers (2)

Reputation: 325

You can set up a spark session to connect to hdfs, then read it from there.

ss = SparkSession.builder.appName(...)
csv_file = ss.read.csv('/user/file.csv')

Another way is to open the file first, then read it using csv.csv_read Here is what I used at the end.

from pyarrow import csv
file = 'hdfs://user/file.csv'

with fs.open(file, 'rb') as f:
    csv_file = csv.read_csv(f)

Upvotes: 0

Reputation: 1708

You need to create a file-like object and use the CSV module directly. See pyarrow.csv.read_csv

Upvotes: 1