Reputation: 3514
Would anyone have any tips to clean text data? The data I have is in a list (master_list
) and I am trying to create a loop or function that would remove extra []
symbols as well as a None,
or None
so basically the data in master_list
would just be strings separated by a ,
Any help greatly appreciated..
master_list = [['the supply fan speed mean is over 90% like the fan isnt building static, mean value recorded is 94.3.', 'the supply fan is running, the VFD speed output mean value is 94.3.'], None, ['the supply fan speed mean is over 90% like the fan isnt building static, mean value recorded is 94.2.', 'the supply fan is running, the VFD speed output mean value is 94.2.'], None, ['the supply fan speed mean is over 90% like the fan isnt building static, mean value recorded is 94.1.', 'the supply fan is running, the VFD speed output mean value is 94.1.'], None, ['the supply fan speed mean is over 90% like the fan isnt building static, mean value recorded is 94.0.', 'the supply fan is running, the VFD speed output mean value is 94.0.'], None, ['the supply fan speed mean is over 90% like the fan isnt building static, mean value recorded is 93.9.', 'the supply fan is running, the VFD speed output mean value is 93.9.'], None]
Upvotes: 2
Views: 363
Reputation: 524
By "remove extra []
" you mean to flatten the array. In order to do this, create a new empty list, and add each list to the end of that one. In Python, when you use the +
operator on lists it concatenates them.
new_list = []
for sublist in master_list:
new_list += list(sublist) #cast the sublist to a list in case it is not already
In order to remove objects you don't want from the list, create a remove_all function to remove all elements from a list:
def remove_all(lst, val):
return [item for item in lst if not item == val]
In addition, this Medium article contains more text transformations you might want to make when cleaning your data.
=========================================================================
If there are lists nested inside that list, you'll need to make a recursive flattening function:
def flatten(item):
if isinstance(item, list) is False:
return [item]
else:
new_list = []
for val in item:
new_list += flatten(val)
return new_list
Upvotes: 0
Reputation: 109716
You want to flatten your list, so [[1, 2], [3, 4]]
becomes [1, 2, 3, 4]
. One way to do this is via a list comprehension: [x for sublist in my_list for x in sublist]
.
However, your data also contains None
instead of lists, so this needs to be filtered out. In addition, the sublists could also contain None
which would also need to be removed. So [[1, 2], None, [None, 3, ""]]
becomes [1, 2, 3]
.
To do this first part (remove None
values when a list is expected), we can effectively replace these Nones with an empty list using the or
operator: sublist or []
. We can't iterate over None
, but we can iterate over an empty list.
To do the second part (remove None
values contained in the list, together with other "falsey" values such as empty strings or zeroes), we add a conditional at the end of the list comprehension: [... if x]
.
So the final result is:
>>> [x for sublist in master_list for x in sublist or [] if x]
['the supply fan speed mean is over 90% like the fan isnt building static, mean value recorded is 94.3.',
'the supply fan is running, the VFD speed output mean value is 94.3.',
'the supply fan speed mean is over 90% like the fan isnt building static, mean value recorded is 94.2.',
'the supply fan is running, the VFD speed output mean value is 94.2.',
'the supply fan speed mean is over 90% like the fan isnt building static, mean value recorded is 94.1.',
'the supply fan is running, the VFD speed output mean value is 94.1.',
'the supply fan speed mean is over 90% like the fan isnt building static, mean value recorded is 94.0.',
'the supply fan is running, the VFD speed output mean value is 94.0.',
'the supply fan speed mean is over 90% like the fan isnt building static, mean value recorded is 93.9.',
'the supply fan is running, the VFD speed output mean value is 93.9.']
Upvotes: 1
Reputation: 3907
List comprehension for the win.
master_list = [['the supply fan speed mean is over 90% like the fan isnt building static, mean value recorded is 94.3.', 'the supply fan is running, the VFD speed output mean value is 94.3.'], None, ['the supply fan speed mean is over 90% like the fan isnt building static, mean value recorded is 94.2.', 'the supply fan is running, the VFD speed output mean value is 94.2.'], None, ['the supply fan speed mean is over 90% like the fan isnt building static, mean value recorded is 94.1.', 'the supply fan is running, the VFD speed output mean value is 94.1.'], None, ['the supply fan speed mean is over 90% like the fan isnt building static, mean value recorded is 94.0.', 'the supply fan is running, the VFD speed output mean value is 94.0.'], None, ['the supply fan speed mean is over 90% like the fan isnt building static, mean value recorded is 93.9.', 'the supply fan is running, the VFD speed output mean value is 93.9.'], None]
master_list = [i for x in master_list if x for i in x]
Upvotes: 0
Reputation: 653
It looks like you are asking for a flattened list instead of a list containing lists. At the same time, you want the None objects removed. The flattening of the list can be done using a method described in this answer. Now, you just have to add an if statement in the middle.
master_list = [x for sublist in master_list if sublist is not None for x in sublist]
Output:
['the supply fan speed mean is over 90% like the fan isnt building static, mean value recorded is 94.3.',
'the supply fan is running, the VFD speed output mean value is 94.3.',
'the supply fan speed mean is over 90% like the fan isnt building static, mean value recorded is 94.2.',
'the supply fan is running, the VFD speed output mean value is 94.2.',
'the supply fan speed mean is over 90% like the fan isnt building static, mean value recorded is 94.1.',
'the supply fan is running, the VFD speed output mean value is 94.1.',
'the supply fan speed mean is over 90% like the fan isnt building static, mean value recorded is 94.0.',
'the supply fan is running, the VFD speed output mean value is 94.0.',
'the supply fan speed mean is over 90% like the fan isnt building static, mean value recorded is 93.9.',
'the supply fan is running, the VFD speed output mean value is 93.9.']
Upvotes: 0