Reputation: 5420
I have a big csv file that I want to edit. The editing here means deleting the columns that have only one value. So far, I wrote this (since I'm a newbie in Python I'm stuck and not sure if that is the right solution for the problem):
import csv
import collections
import numpy as np
number_of_rows = 2432
interseting_cols = []
col_values = collections.defaultdict(list)
col_values_named = collections.defaultdict(list)
new_row = collections.defaultdict(list)
inputFile = open('input.csv', 'r',newline='');
outputFile= open('output.csv','w')
reader = csv.reader(inputFile)
writer = csv.writer(outputFile)
#skip field names
next(reader)
for row in reader:
for col, value in enumerate(row):
col_values[col].append(value)
#each column is now saved col_values ( without the headers )
for i in range(len(col_values)):
if len(set(col_values[i][:(number_of_rows-1)])) != 1:
interseting_cols.append(i)# saved the index of the columns with valid values
inputFile.seek(0)
# reading the file again now with headers
for row in reader:
for col, value in enumerate(row):
col_values_named[col].append(value)# saving the columns now with header
# generating a new CSV file, only with interessting columns :
for i in range(number_of_rows):
print("i value ",i)
for j in range(len(interseting_cols)): # I'm not sure about this PART !!!!
new_row.append(col_values_named[interseting_cols[j]])
writer.writerow(new_row)
Any idea how to do the last loop? Or is there is a better way to solve this?
UPDATE say the file looks like
---------------------------------------------------
|A|B |C |D |F |G|H |I|J |K |
---------------------------------------------------
1 |1|NULL|444 |201|0.01|A|NULL|4|9.5|NULL|
---------------------------------------------------
2 |2|NULL|NULL|201|0 |A|NULL|4|9.5|NULL|
---------------------------------------------------
3 |4|NULL|444 |201|0 |A|NULL|4|9.5|NULL|
---------------------------------------------------
4 |1|NULL|444 |201|0 |A|NULL|4|9.5|NULL|
in this case the result should only include only three columns A,C and F
Upvotes: 2
Views: 1909
Reputation: 8192
Unless the spreadsheet is truly enormous, just read the whole thing in and then find what you want!
Untested code:
headers = reader.next()
sheet = []
for row in reader:
sheet.append(row)
# assuming all rows are the same length ...
for colno,header in enumerate(headers):
col = [ row[colno] for row in sheet ]
distinct = set( col)
if len(distinct) > 1:
# col contains at least two distinct values, so
# do something with it and its header and/or column number
writer.writerow( [header, colno] + col )
Upvotes: 1
Reputation: 362
Using pandas library, you can reduce all your extra work by its own inbuilt functions. Here is a small implementation of the requirement you posted above. If you are a beginner and in need of little more clear explanations, ping me in comment and am ready to give a little more information. By the way, start playing around with pandas.
import pandas as pd
df = pd.read_csv('input.csv')
for columns in df:
if len(df[columns].unique()) == 1:
df.drop(columns, 1, inplace=True)
df.to_csv('output.csv', index=None)
Upvotes: 2