Reputation: 6465
I have made a connection to my HDFS using the following command
import pyarrow as pa
import pyarrow.parquet as pq
fs = pa.hdfs.connect(self.namenode, self.port, user=self.username, kerb_ticket = self.cert)
I'm using the following command to read a parquet file
fs.read_parquet()
but there is not read method for regular text files (e.g. a csv file). How can I read a csv file using pyarrow.
Upvotes: 2
Views: 2924
Reputation: 325
You can set up a spark session to connect to hdfs, then read it from there.
ss = SparkSession.builder.appName(...)
csv_file = ss.read.csv('/user/file.csv')
Another way is to open the file first, then read it using csv.csv_read Here is what I used at the end.
from pyarrow import csv
file = 'hdfs://user/file.csv'
with fs.open(file, 'rb') as f:
csv_file = csv.read_csv(f)
Upvotes: 0
Reputation: 1708
You need to create a file-like object and use the CSV module directly. See pyarrow.csv.read_csv
Upvotes: 1