Reputation: 69
I'm attempting to write a program that enters a directory full of CSV files (all with the same layout but different data), reads the files, and writes all the data in the specific columns to a new CSV file. I would also like it to miss out the entire row of data is there is a blank space in one of the columns (in this case, if there is a gap in the Name column).
The program works fine in writing in specific columns (in this case Name and Location) from the old CSV files to the new one, however, I am unsure as to how I would miss out a line if there was a blank space.
import nltk
import csv
from nltk.corpus import PlaintextCorpusReader
root = '/Users/bennaylor/local documents/humanotics'
incorpus = root + '/chats/input/'
outcorpus =root + '/chats/output.csv'
doccorpus = PlaintextCorpusReader(incorpus, '.*\.csv')
filelist = doccorpus.fileids()
with open(outcorpus, 'w', newline='') as fw:
fieldnames = (['Name','Location'])
writer = csv.DictWriter(fw, fieldnames=fieldnames)
writer.writeheader()
print('Writing Headers!!')
for rawdoc in filelist:
infile = incorpus + rawdoc
with open(infile, encoding='utf-8') as fr:
reader = csv.DictReader(fr)
for row in reader:
rowName = row['Name']
rowLocation = row['Location']
writer.writerow({'Name': rowName, 'Location': rowLocation})
An example CSV input file would look like this:
Name,Age,Location,Birth Month
Steve,14,London,November
,18,Sydney,April
Matt,12,New York,June
Jeff,20,,December
Jonty,19,Greenland,July
With gaps in the Name column on the third row, and Location column on the fifth. In this case, I would like the program to miss out the third row when writing the data to a new CSV as there is a gap in the Name column
Thanks in advance for any help.
Upvotes: 0
Views: 965
Reputation: 866
This is easy to do using pandas:
import pandas as pd
import os
# Create an empty data frame
df = pd.DataFrame()
# Add the data from all the files into the data frame
for filename in filelist:
data = pd.read_csv(os.path.join(incorpus, filename))
df = df.append(data)
# Drop rows with any empty values
df = df.dropna()
# Keep only the needed columns
df = df.reindex(columns=['Name', 'Location'])
# Write the dataframe to a .csv file
df.to_csv(outcorpus)
Upvotes: 2