Reputation: 1015

How to test if a column exists and is not null in a DataFrame

I have a python RDD:

rddstats = rddstats.filter(lambda x : len(x) == NB_LINE or len(x) == NB2_LINE)

I created a dataframe based on this RDD:

logsDF = sqlContext.createDataFrame(rddstats,schema=["column1","column2","column3","column4","column5","column6","column7"])

I would like to do a test on the two columns 6 and 7:
if column 6 exists in the dataframe and not null I should return dataframe containing this value of the column 6, else I should return a dataframe that contains the value of the column 7. This following my small code:

logsDF = sqlContext.createDataFrame(rddstats,schema=["column1","column2","column3","column4","column5","column6","column7"])
if (logsDF['column6'] in rddstats and logsDF['column6'].isNotNull):
    logsDF.select("column1","column2","column3","column4","column5","column6")
else:
    logsz84statsDF.select("column1","column2","column3","column4","column5","column7")

Is the syntax correct and have I the right to write in Python like this ?

Upvotes: 2

Answers (3)

Harvey

Reputation: 339

I think this may be faster

 if 'column_name' not in df.columns:
    do_something 
 if len([x in x for df['column_name'].unique() if x.isna()]) > 0:
    do_something_else

Upvotes: 0

GeorgeLPerkins

Reputation: 1146

if (logsDF['column6'] in rddstats and logsDF['column6'].isNotNull)

I'm pretty sure you are going to be throwing a KeyError if column6 does not exist.

You could do something like:

if 'column6' in logsDF.columns:
    if logsDF['column6'].notnull().any():
        logsDF.select("column1","column2","column3","column4","column5","column6")
    else:
        logsz84statsDF.select("column1","column2","column3","column4","column5","column7")
else:
    logsz84statsDF.select("column1","column2","column3","column4","column5","column7")

Check to see if column6 exists in logsDF columns first. If so, see if any() value is not null.

Column7 is used if column6 does not exist, or if column6 exists but all values are null.

Editing my own comment: Since python will not evaluate the second condition if the first is False, you can actually do:

if 'column6' in logsDF.columns and logsDF['column6'].notnull().any():
    logsDF.select("column1","column2","column3","column4","column5","column6")
else:
    logsz84statsDF.select("column1","column2","column3","column4","column5","column7")

as long as the 'column6' in logsDF.columns comes first, the logsDF['column6'] will never evaluate and throw the KeyError if column6 doesn't exist.

Upvotes: 2

Akarsh ts

Reputation: 37

if set(['A','C']).issubset(df.columns):
   df['sum'] = df['A'] + df['C']

set([]) can alternatively be constructed with curly braces:

if {'A', 'C'}.issubset(df.columns):

See this question for a discussion of the curly-braces syntax.

Or, you can use a list comprehension, as in:

if all([item in df.columns for item in ['A','C']]):

Upvotes: 2

How to test if a column exists and is not null in a DataFrame

Answers (3)

Related Questions