spark_dev
spark_dev

Reputation: 1

Read a specific file from nested sub-folders

I'm reading a single file from a subfolder its working fine

 val spark = SparkSession
      .builder()
      .master("local")
      .appName("SparkAndHive")
      .config("spark.sql.warehouse.dir", "/tmp/spark-warehouse2")
      .enableHiveSupport()
      .getOrCreate()

    GeoSparkSQLRegistrator.registerAll(spark.sqlContext)

    
    val spatialRDD = ShapefileReader.readToGeometryRDD(spark.sparkContext, "src/main/resources/IND_rds")

    val spatialRDD = ShapefileReader.readToGeometryRDD(spark.sparkContext, "src/main/resources/IND_rrd")

    val rawSpatialDf = Adapter.toDf(spatialRDD,spark)
    rawSpatialDf.createOrReplaceTempView("rawSpatialDf")

    spark.sql("select * from rawSpatialDf").show 
  

So problem is currently my folder structure is below

.
├── ind
│   ├── IND_rds
│   │   ├── IND_roads.dbf
│   │   ├── IND_roads.prj
│   │   ├── IND_roads.shp
│   │   └── IND_roads.shx
│   └── IND_rrd
│       ├── IND_rails.dbf
│       ├── IND_rails.prj
│       ├── IND_rails.shp
│       └── IND_rails.shx
├── nep
│   ├── NPL_rds
│   │   ├── NPL_roads.dbf
│   │   ├── NPL_roads.prj
│   │   ├── NPL_roads.shp
│   │   └── NPL_roads.shx
│   └── NPL_rrd
│       ├── NPL_rails.dbf
│       ├── NPL_rails.prj
│       ├── NPL_rails.shp
│       └── NPL_rails.shx

Is there any alternative approach to pick all country and there respective road and rail directory dynamically from the nested folders

Upvotes: 0

Views: 635

Answers (2)

Shubham B
Shubham B

Reputation: 1

Try using regex/wild characters

inputpath= "src/main/resources/[A-Za-z]\*/\*_rds"
inputpath= "src/main/resources/[A-Za-z]\*/\*_rrd"

Upvotes: 0

Young
Young

Reputation: 584

You are not giving the complete code, but based on my understanding, you could do this better with partitioned table and spark sql module.

load data into dataframes is way more better than load into rdds.

In your table, you could leave two partitioned columns as country and rail and when you read the data, you could just specify the root directory instead of the path of root/country_name/rail_name. and the schema of the dataframe you obtain would be all the columns you have in your files + country_name + rail_name.

However you need to rename your directories first like this:

.
├── country_name=ind
│   ├── rail_name=IND_rds
│   │   ├── IND_roads.dbf
│   │   ├── IND_roads.prj
│   │   ├── IND_roads.shp
│   │   └── IND_roads.shx
│   └── rail_name=IND_rrd
│       ├── IND_rails.dbf
│       ├── IND_rails.prj
│       ├── IND_rails.shp
│       └── IND_rails.shx
├── country_name=nep
│   ├── rail_name=NPL_rds
│   │   ├── NPL_roads.dbf
│   │   ├── NPL_roads.prj
│   │   ├── NPL_roads.shp
│   │   └── NPL_roads.shx
│   └── rail_name=NPL_rrd
│       ├── NPL_rails.dbf
│       ├── NPL_rails.prj
│       ├── NPL_rails.shp
│       └── NPL_rails.shx

then:

 val df = spark.read.load("the root path")
 val df_ind = df.filter(col("country_name") === "ind")

for more info you can refer to the partition discovery section in this doc: https://spark.apache.org/docs/latest/sql-data-sources-parquet.html

Alternatively, if you cannot rename the folder to this format: refer to this link: How to make Spark session read all the files recursively? In simple words, if you are later then spark 3, you use recursiveFileLookup, otherwise you have to play with hdfs listFiles

Upvotes: 1

Related Questions