Reputation: 39
I am reading in a .csv
file of space-delimited data, which contains some unwanted words. I need to find if unwanted words are present in any of columns of a given row, and delete that row.
For example, if unwanted_list = ['one', 'on']
, then for an input .csv
file with columns name class label test;
Input:
ne two 1 five,
on one 2 we.
as we 20 on
cast as none vote
Representative output:
ne two 1 five,
cast as none vote
Upvotes: 0
Views: 3838
Reputation: 450
you may have a look at the csv module documentation, https://docs.python.org/2/library/csv.html
Here is an example code in ipython.
In [1]: import csv
In [2]: f = open('plop.csv')
In [3]: exclude = set(('on', 'one'))
In [4]: reader = csv.reader(f, delimiter=' ')
In [5]: for row in reader:
...: if any(val in exclude for val in row):
...: continue
...: else:
...: print row
...:
['name', 'class', 'label', 'test']
['ne', 'two', '1', 'five,']
['cast', 'as', 'none', 'vote']
Feel free to adapt the script to your needs.
Take care that i did not provide special processinng for the header which can be handled this way. This is not how you should process for very large files since the whole file is read an put into ram.
In [9]: f=open('plop.csv')
In [10]: reader = csv.reader(f.readlines()[1:], delimiter=' ') #skip headers
In [11]: for row in reader:
...: if any(val in exclude for val in row):
...: continue
...: else:
...: print row
...:
['ne', 'two', '1', 'five,']
['cast', 'as', 'none', 'vote']
Upvotes: 0
Reputation: 2085
A simple script using the python set
object should do the trick. This checks that there are no words common to both the set of unwanted words, and the set of words in the line of the input file;
def filter_unwanted_words():
unwanted_words = {'one', 'on'}
with open('input.csv', 'r') as f:
for line in f:
if set(line.split()).isdisjoint(unwanted_words):
yield line
def write_output():
with open('output.csv', 'w') as f:
f.writelines((line for line in filter_unwanted_words()))
if __name__ == '__main__':
write_output()
The output in output.csv
for this is;
ne two 1 five,
cast as none vote
Upvotes: 1