Reputation: 456
I have a CSV file that contains information about people and all sorts of data that takes up more than 100 columns. There are no headers and my main intention is to grab the peoples' names only. Not the other data related to it. How can I do this?
This is my CSV file --- 'data.csv':
John 12 34 23 48 14 44 94 24 ... #extends till 100
Becky 23 40 93 47 84 43 64 31 ... #extends till 100
Lio 63 90 53 77 14 12 69 20 ... #extends till 100
Following this, let's say I have a list in my code that is populated with lots of names:
names = ['Timothy', 'Joshua', 'Rio', 'Catherine', 'Poorva', 'Gome', 'Lachlan', 'John', 'Lio']
I opened my CSV file in Python and used list comprehension in order to read all the names in the first column and store them in a list with a variable 'people_list' assigned.
Now, for all elements in the people_list, if the element is not seen in the 'names' list, I want to delete that element in my CSV file. In this example, I want to delete Becky since she does not appear in the names list. This is what I tried so far...
Demo -- data.py:
names = ['Timothy', 'Joshua', 'Rio', 'Catherine', 'Poorva', 'Gome', 'Lachlan', 'John', 'Lio']
csv_filename = data.csv
with open(csv_filename, 'r') as readfile:
reader = csv.reader(readfile, delimiter=',')
people_list = [row[0] for row in reader]
for person in people_list:
if person not in names:
id = people_list.index(person) #grab the index of the person in people_list who's not found in the names list.
#using pandas
df = pd.read_csv(csv_filename) #read data.csv file
df.drop(df.index[id], in_place = True) #delete the row id for the person who does not exist in names list.
df.to_csv(csv_filename, index = False, sep=',') #close the csv file with no index
else:
print("This person is found in the names list")
Instead of deleting Becky, all the records in my CSV file were deleted (including Becky). Can someone explain how to do this?
Upvotes: 2
Views: 1153
Reputation: 862661
Add parameter header=None
to read_csv
for default columns 0,1,2
...:
df = pd.read_csv(csv_filename, header=None)
names = ['Timothy', 'Joshua', 'Rio', 'Catherine', 'Poorva', 'Gome', 'Lachlan', 'John', 'Lio']
Then select first column by df[0]
and test membership by Series.isin
and filter by boolean indexing
:
df = df[df[0].isin(names)]
print (df)
Last write to file:
df.to_csv(csv_filename1, index = False, header=None)
Upvotes: 1