Reputation: 251
I'm trying to clean data that there are in a string. More specifically, my dataset consists of
Data Links
link1.com ['#','link1bias','bias', 'link12']
href.com.co ['','href1223', 'hreftest']
...
What I would like to have is
Data Links Count
link1.com ['bias'] 1
href.com.co [] 2
...
As you can see, I should clean the lists removing elements which contain the corresponding word in Data column (it should contain at least the whole word), keeping words not null or with length less than 5 characters, and then count how many full stops are in the link in Data.
For the count, I would do: df['Data'].count('.')
, but I feel that I should use apply
for that.
For the links I would use the join
and |
. But I am having some problems in removing null and words with a small length (based on the threshold of 5).
Can you please if this is a valid approach or if there is another way to get the desired output?
Upvotes: 0
Views: 47
Reputation: 953
Apply should work well in your case because it gives you full control of the cleanup.
def clean(row):
data_list = row['Data'].split('.')
lnk = data_list[0]
row['Count'] = len(data_list)-1
row['Links'] = [el for el in row['Links'] if
lnk not in el and len(el) > 3]
return row
df = df.apply(clean, axis = 1)
print(df)
Output:
Data Links Count
0 link1.com [bias] 1
1 href.com.co [] 2
Upvotes: 2
Reputation: 170
you want to use regex here. It would be simple enough to define that you only want alphabetical characters and of a certain length. I am not a complete expert but you will want something like re.findall("^/w+", string) and use it in a loop. Count can be done in the same loop. –
Upvotes: 0