Shweta Kamble
Shweta Kamble

Reputation: 432

Wordcount Nonetype error pyspark-

I am trying to do some text analysis:

def cleaning_text(sentence):
   sentence=sentence.lower()
   sentence=re.sub('\'','',sentence.strip())
   sentence=re.sub('^\d+\/\d+|\s\d+\/\d+|\d+\-\d+\-\d+|\d+\-\w+\-\d+\s\d+\:\d+|\d+\-\w+\-\d+|\d+\/\d+\/\d+\s\d+\:\d+',' ',sentence.strip())# dates removed
   sentence=re.sub(r'(.)(\/)(.)',r'\1\3',sentence.strip())
   sentence=re.sub("(.*?\//)|(.*?\\\\)|(.*?\\\)|(.*?\/)",' ',sentence.strip())
   sentence=re.sub('^\d+','',sentence.strip())
   sentence = re.sub('[%s]' % re.escape(string.punctuation),'',sentence.strip())
   cleaned=' '.join([w for w in sentence.split() if not len(w)<2 and w not in ('no', 'sc','ln') ])
   cleaned=cleaned.strip()
   if(len(cleaned)<=1):
        return "NA"
   else:
       return cleaned

org_val=udf(cleaning_text,StringType())
df_new =df.withColumn("cleaned_short_desc", org_val(df["symptom_short_description_"]))
df_new =df_new.withColumn("cleaned_long_desc", org_val(df_new["long_description"]))
longWordsDF = (df_new.select(explode(split('cleaned_long_desc',' ')).alias('word'))
longWordsDF.count()

I get the following error.

File "<stdin>", line 2, in cleaning_text AttributeError: 'NoneType' object has no attribute 'lower'

I want to perform word counts but any kind of aggregation function is giving me an error.

I tried following things:

sentence=sentence.encode("ascii", "ignore")

Added this statement in the cleaning_text function

df.dropna()

Its still giving the same issue, I do not know how to resolve this issue.

Upvotes: 0

Views: 274

Answers (1)

Mariusz
Mariusz

Reputation: 13926

It looks like you have null values in some columns. Add an if at the beginning of cleaning_text function and the error will disappear:

if sentence is None:
    return "NA"

Upvotes: 2

Related Questions