Opal
Opal

Reputation: 84854

How to read multiple parquet tables?

I've the following folder structure:

.
└── raw
    ├── enwiki-20200401-pages-articles-multistream.xml
    ├── enwiki-20200401-pages-articles-multistream.xml.bz2
    ├── stg
    ├── wkp_header
    ├── wkp_link_external
    ├── wkp_link_wiki
    ├── wkp_page
    ├── wkp_page_simple
    ├── wkp_redirect
    ├── wkp_table
    ├── wkp_tag
    ├── wkp_template
    ├── wkp_template_param
    └── wkp_text

Under all these wkp_* there are *.parquet files located.

When I try to read the data in the following way:

val df = spark.read.parquet(
      List(
        "raw/wkp_text",
        "raw/wkp_page"): _*
    )
df.printSchema()

I only get schema for wkp_page printed.

Why is that? How can I check if all the data (from all tables passed) has been loaded? How to refer to wkp_text table?

Upvotes: 2

Views: 497

Answers (1)

Nir Hedvat
Nir Hedvat

Reputation: 870

Try

spark.read.option("mergeSchema", "true").parquet(...) 

Please notice that all read parquet files must have the same schema

Upvotes: 3

Related Questions