Reputation: 1052
I'm using pydoop to read in a file from hdfs, and when I use:
import pydoop.hdfs as hd
with hd.open("/home/file.csv") as f:
print f.read()
It shows me the file in stdout.
Is there any way for me to read in this file as dataframe? I've tried using pandas' read_csv("/home/file.csv"), but it tells me that the file cannot be found. The exact code and error is:
>>> import pandas as pd
>>> pd.read_csv("/home/file.csv")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 498, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 275, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 590, in __init__
self._make_engine(self.engine)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 731, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 1103, in __init__
self._reader = _parser.TextReader(src, **kwds)
File "pandas/parser.pyx", line 353, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3246)
File "pandas/parser.pyx", line 591, in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:6111)
IOError: File /home/file.csv does not exist
Upvotes: 10
Views: 34841
Reputation: 43
You can read and write with pyarrow natively. I found the pydoop library to be a bit clumsy and require lots of annoying dependencies. The syntax is as follows:
from pyarrow import fs
import pyarrow.parquet as pq
# connect to hadoop
hdfs = fs.HadoopFileSystem('hostname', 8020)
# will read single file from hdfs
with hdfs.open_input_file(path) as pqt:
df = pq.read_table(pqt).to_pandas()
# will read directory full of partitioned parquets (ie. from spark)
df = pq.ParquetDataset(path, hdfs).read().to_pandas()
Upvotes: 0
Reputation: 11
you can use the following code to read csv from hdfs
import pandas as pd
import pyarrow as pa
hdfs_config = {
"host" : "XXX.XXX.XXX.XXX",
"port" : 8020,
"user" : "user"
}
fs = pa.hdfs.connect(hdfs_config['host'], hdfs_config['port'],
user=hdfs_config['user'])
df=pd.read_csv(fs.open("/home/file.csv"))
Upvotes: 1
Reputation: 7
Use read
instead open
, it works
with hd.read("/home/file.csv") as f:
df = pd.read_csv(f)
Upvotes: 0
Reputation: 231385
I know next to nothing about hdfs
, but I wonder if the following might work:
with hd.open("/home/file.csv") as f:
df = pd.read_csv(f)
I assume read_csv
works with a file handle, or in fact any iterable that will feed it lines. I know the numpy
csv readers do.
pd.read_csv("/home/file.csv")
would work if the regular Python file open
works - i.e. it reads the file a regular local file.
with open("/home/file.csv") as f:
print f.read()
But evidently hd.open
is using some other location or protocol, so the file is not local. If my suggestion doesn't work, then you (or we) need to dig more into the hdfs
documentation.
Upvotes: 18