Reputation: 1015
I have a python RDD:
rddstats = rddstats.filter(lambda x : len(x) == NB_LINE or len(x) == NB2_LINE)
I created a dataframe based on this RDD:
logsDF = sqlContext.createDataFrame(rddstats,schema=["column1","column2","column3","column4","column5","column6","column7"])
I would like to do a test on the two columns 6 and 7
:
if column 6 exists in the dataframe and not null I should return dataframe containing this value of the column 6
, else I should return a dataframe that contains the value of the column 7
.
This following my small code:
logsDF = sqlContext.createDataFrame(rddstats,schema=["column1","column2","column3","column4","column5","column6","column7"])
if (logsDF['column6'] in rddstats and logsDF['column6'].isNotNull):
logsDF.select("column1","column2","column3","column4","column5","column6")
else:
logsz84statsDF.select("column1","column2","column3","column4","column5","column7")
Is the syntax correct and have I the right to write in Python like this ?
Upvotes: 2
Views: 12306
Reputation: 339
I think this may be faster
if 'column_name' not in df.columns:
do_something
if len([x in x for df['column_name'].unique() if x.isna()]) > 0:
do_something_else
Upvotes: 0
Reputation: 1146
if (logsDF['column6'] in rddstats and logsDF['column6'].isNotNull)
I'm pretty sure you are going to be throwing a KeyError if column6 does not exist.
You could do something like:
if 'column6' in logsDF.columns:
if logsDF['column6'].notnull().any():
logsDF.select("column1","column2","column3","column4","column5","column6")
else:
logsz84statsDF.select("column1","column2","column3","column4","column5","column7")
else:
logsz84statsDF.select("column1","column2","column3","column4","column5","column7")
Check to see if column6 exists in logsDF columns first. If so, see if any() value is not null.
Column7 is used if column6 does not exist, or if column6 exists but all values are null.
Editing my own comment: Since python will not evaluate the second condition if the first is False, you can actually do:
if 'column6' in logsDF.columns and logsDF['column6'].notnull().any():
logsDF.select("column1","column2","column3","column4","column5","column6")
else:
logsz84statsDF.select("column1","column2","column3","column4","column5","column7")
as long as the 'column6' in logsDF.columns comes first, the logsDF['column6'] will never evaluate and throw the KeyError if column6 doesn't exist.
Upvotes: 2
Reputation: 37
if set(['A','C']).issubset(df.columns):
df['sum'] = df['A'] + df['C']
set([])
can alternatively be constructed with curly braces:
if {'A', 'C'}.issubset(df.columns):
See this question for a discussion of the curly-braces syntax.
Or, you can use a list comprehension, as in:
if all([item in df.columns for item in ['A','C']]):
Upvotes: 2