BigDataLearner
BigDataLearner

Reputation: 1468

Filter out spark dataframe rows which do not have certain nested fields in Java

I have below sample row in spark dataframe.

{
    "id":"B07H3MVTSN",
    "mid":4,
    "inner":{
      "type1":[{
          "cid":"B06XVVSLX8"
        },
        {
          "cid":"B06XJ2JZ2Z"
        }
      ],
      "type2":[{
          "cid":"B06XVVSLX1"
        },
        {
          "cid":"B06XJ2JZ22"
        }
      ],
      "type3":[{
          "cid":"B06XVVSLX3"
        },
        {
          "cid":"B06XJ2JZ24"
        }
      ]
    }
  }

I have a list of strings which can be {"type1","type2"}.

I want to filter out(get rid of) rows from the dataframe which do not have either of those two fields. How to add filter() condition which checks existence of nested struct fields from a set of values from a list

e.g below sample row should be removed using filter condition since it does not contain type1 and type2

{
        "id":"B07H3MVTS1",
        "mid":4,
        "inner":{
          "type3":[{
              "cid":"B06XVVSLX3"
            },
            {
              "cid":"B06XJ2JZ24"
            }
          ]
        }
      }

Schema for the above data can be seen in below method getSchema()

private StructType getSchema(){

        ArrayType types = DataTypes.createArrayType(DataTypes.createStructType(
                new StructField[] {
                        DataTypes.createStructField("cid", DataTypes.StringType, false)
                }));

        StructType inner = DataTypes.createStructType(
                new StructField[] {
                        DataTypes.createStructField("type1", types, true), DataTypes.createStructField("type2", types, true),
                        DataTypes.createStructField("type3", types, true),
                        DataTypes.createStructField("type4", types, true)
                });

        return DataTypes.createStructType(
                new StructField[] {
                        DataTypes.createStructField("id", DataTypes.StringType, false),
                        DataTypes.createStructField("mid", DataTypes.IntegerType, false),
                        DataTypes.createStructField("inner", inner, false)
                });
    }

Upvotes: 0

Views: 487

Answers (1)

koiralo
koiralo

Reputation: 23119

You can just filter the list of fields with isNotNull function as below, adjust the filter condition as per your need

val fields = List("type1", "type2")

val filter = fields.map(f => col(s"inner.$f").isNotNull).reduce(_ or _)

df.filter(filter).show(false)

Output:

+----------+------------------------------------------------------------------------------------------+---+
|id        |inner                                                                                     |mid|
+----------+------------------------------------------------------------------------------------------+---+
|B07H3MVTSN|{[{B06XVVSLX8}, {B06XJ2JZ2Z}], [{B06XVVSLX1}, {B06XJ2JZ22}], [{B06XVVSLX3}, {B06XJ2JZ24}]}|4  |
+----------+------------------------------------------------------------------------------------------+---+

Upvotes: 1

Related Questions