Reputation: 620
Since Modin does not support loading from multiple pyarrow files on s3, I am using pyarrow to load the data.
import s3fs
import modin.pandas as pd
from pyarrow import parquet
s3 = s3fs.S3FileSystem(
key=aws_key,
secret=aws_secret
)
table = parquet.ParquetDataset(
path_or_paths="s3://bucket/path",
filesystem=s3,
).read(
columns=["hotelId", "startDate", "endDate"]
)
# to get a pandas df the next step would be table.to_pandas()
If I know want to put the data in a Modin df for parallel computations without having to write to and read from a csv? Is there a way to construct the Modin df directly from a pyarrow.Table or at least from a pandas dataframe?
Upvotes: 5
Views: 1142
Reputation: 101
Mahesh's answer should work but I believe it would result in a full data copy (2X memory footprint by default: https://arrow.apache.org/docs/python/pandas.html#memory-usage-and-zero-copy)
At the time of writing Modin does have a native arrow integration, so you can directly convert using
from modin.pandas.utils import from_arrow
mdf = from_arrow(pyarrow_table)
Upvotes: 2
Reputation: 176
You can't construct the Modin dataframe directly out of a pyarrow.Table
, because pandas doesn't support that, and Modin only supports a subset of the pandas API. However, the table has a method that converts it to a pandas dataframe, and you can construct the Modin dataframe out of that. Using table
from your code:
import modin.pandas as pd
modin_dataframe = pd.Dataframe(table.to_pandas())
Upvotes: 0