Removing elements which does not satisfy specific requirements

Question

I'm trying to clean data that there are in a string. More specifically, my dataset consists of

Data            Links       
link1.com      ['#','link1bias','bias', 'link12']
href.com.co    ['','href1223', 'hreftest']
...

What I would like to have is

Data           Links        Count     
link1.com     ['bias']        1
href.com.co    []             2
...

As you can see, I should clean the lists removing elements which contain the corresponding word in Data column (it should contain at least the whole word), keeping words not null or with length less than 5 characters, and then count how many full stops are in the link in Data.

For the count, I would do: df['Data'].count('.'), but I feel that I should use apply for that. For the links I would use the join and |. But I am having some problems in removing null and words with a small length (based on the threshold of 5). Can you please if this is a valid approach or if there is another way to get the desired output?

LevB · Accepted Answer

Apply should work well in your case because it gives you full control of the cleanup.

def clean(row):
    data_list = row['Data'].split('.')
    lnk = data_list[0]
    row['Count'] = len(data_list)-1
    row['Links'] = [el for el in row['Links'] if 
        lnk not in el and len(el) > 3]
    return row

df = df.apply(clean, axis = 1)

print(df)

Output:

        Data   Links  Count
0  link1.com  [bias]      1
1  href.com.co    []      2

Removing elements which does not satisfy specific requirements

Answers (2)

Related Questions