Reputation: 84854
I've the following folder structure:
.
└── raw
├── enwiki-20200401-pages-articles-multistream.xml
├── enwiki-20200401-pages-articles-multistream.xml.bz2
├── stg
├── wkp_header
├── wkp_link_external
├── wkp_link_wiki
├── wkp_page
├── wkp_page_simple
├── wkp_redirect
├── wkp_table
├── wkp_tag
├── wkp_template
├── wkp_template_param
└── wkp_text
Under all these wkp_*
there are *.parquet
files located.
When I try to read the data in the following way:
val df = spark.read.parquet(
List(
"raw/wkp_text",
"raw/wkp_page"): _*
)
df.printSchema()
I only get schema for wkp_page
printed.
Why is that? How can I check if all the data (from all tables passed) has been loaded? How to refer to wkp_text
table?
Upvotes: 2
Views: 497
Reputation: 870
Try
spark.read.option("mergeSchema", "true").parquet(...)
Please notice that all read parquet files must have the same schema
Upvotes: 3