Math
Math

Reputation: 251

Removing elements which does not satisfy specific requirements

I'm trying to clean data that there are in a string. More specifically, my dataset consists of

Data            Links       
link1.com      ['#','link1bias','bias', 'link12']
href.com.co    ['','href1223', 'hreftest']
...

What I would like to have is

Data           Links        Count     
link1.com     ['bias']        1
href.com.co    []             2
...

As you can see, I should clean the lists removing elements which contain the corresponding word in Data column (it should contain at least the whole word), keeping words not null or with length less than 5 characters, and then count how many full stops are in the link in Data.

For the count, I would do: df['Data'].count('.'), but I feel that I should use apply for that. For the links I would use the join and |. But I am having some problems in removing null and words with a small length (based on the threshold of 5). Can you please if this is a valid approach or if there is another way to get the desired output?

Upvotes: 0

Views: 47

Answers (2)

LevB
LevB

Reputation: 953

Apply should work well in your case because it gives you full control of the cleanup.

def clean(row):
    data_list = row['Data'].split('.')
    lnk = data_list[0]
    row['Count'] = len(data_list)-1
    row['Links'] = [el for el in row['Links'] if 
        lnk not in el and len(el) > 3]
    return row

df = df.apply(clean, axis = 1)

print(df)

Output:

        Data   Links  Count
0  link1.com  [bias]      1
1  href.com.co    []      2

Upvotes: 2

thejahcoop
thejahcoop

Reputation: 170

you want to use regex here. It would be simple enough to define that you only want alphabetical characters and of a certain length. I am not a complete expert but you will want something like re.findall("^/w+", string) and use it in a loop. Count can be done in the same loop. –

Upvotes: 0

Related Questions