Reputation: 6018
I have a hypothetical example here with file attached (Excel File Link) where I'm loading in a file from excel and formatting it into something I can work with to either analyse or store more permanently.
In R I would use the following few lines to make it usable:
library(gdata)
tmp = read.xls('~/Desktop/sample.xlsx',1,stringsAsFactors=F)
tmp = tmp[,!sapply(tmp,function(x)all(is.na(x)))]
tmp$date = tmp[1,2]
for(i in 2:ncol(tmp)){ifelse(tmp[2,i]=='',tmp[2,i]<-tmp[2,i-1],tmp[2,i]<-tmp[2,i])}
tmp[2,1]=tmp[1,1]
names(tmp)=paste(tmp[2,],tmp[3,],sep='.')
tmp=tmp[-c(1:3),]
names(tmp)[10]='date'
In python - I've gotten as far as
import xlrd
myf='/home/me/Desktop/sample.xlsx'
sheet = xlrd.open_workbook(myf).sheet_by_index(0)
s1=[]
r = sheet.nrows
c = sheet.ncols
for row in range(0,r):
t1=[]
for col in range(0,c):
t1.append(sheet.cell_value(row,col))
s1.append(t1)
but then I've had very little success getting rid of empty rows and columns. All of the following fail.
>>> s1[0]
['', '', '', '', '', '', '', '', '', '', '', '']
>>> s1[0] == []
False
>>> s1[0] == ''
False
>>> s1[0] == all('')
False
so I'm not even too clear how I check that the entire list is empty.
I can zip the rows 2 & 3 (5 & 6 in python)
>>> zip(s1[5],s1[6])
[('', ''), (u'cat1', u'a'), ('', u'b'), ('', ''), (u'cat2', u'c'), ('', u'd'), ('', ''), (u'cat3', u'e'), ('', u'f'), ('', ''), (u'cat4', u'g'), ('', u'h')]
but I don't know how I'd get that to paste forwards in a for-next loop.
Very n00b question reflective of my understanding of python as is. Any thoughts would be most welcome. Submitted with some trepidation because I recognize the question has a 'homework' feel even though it is actually a personal learning exercise. Thanks
after a bit of messing about I've worked up a rough and ready working example below: Would be grateful for pointers on how to do it more efficiently.
I had a try at pandas but found the learning curve quite steep. If someone could post a working MWE I'd be pleased to mark it answered.
import os
import xlrd
import pandas as pd
import pprint
import re
import csv
'''
Create a few helper functions to save time
finding things, picking empties and selecting items
'''
def nulls(x):
g = lambda r: all(i == '' for i in r)
out = [i for i,j in enumerate(x) if g(j)]
return(out)
def fill(x):
for i in range(1,len(x)):
if x[i] == '':
x[i] = x[i-1]
return(x)
def which(x,y):
out = [i for i,j in enumerate(x) if y(j) ]
return(out)
def T(x):
out = map(list,zip(*x))
return(out)
def rbind(x,y,sep=None):
if sep is None:
sep='.'
out = [str(i[0]) + sep + str(i[1]) for i in zip(x,y)]
return(out)
# def csvit(x):
# tmp = next(key for key,val in globals().items() if val == x and not key.startswith('_'))
# f=open('/home/me/Desktop/csvs/'+tmp+'.csv','wb')
# writer = csv.writer(f,quotechar='"', quoting=csv.QUOTE_ALL,dialect=csv.excel)
# [writer.writerow(i) for i in x]
# f.close()
#
# Load spreadsheet from file and convert to python list
#
sheet = xlrd.open_workbook('/home/me/Downloads/sample.xlsx').sheet_by_index(0)
s = [sheet.row_values(i) for i in range(sheet.nrows)]
#
# Get rid of unnecessary excel formatting and spacing
#
# rows first
s = [j for i,j in enumerate(s) if i not in nulls(s)]
# transpose & then columns (surely there is a more elegant way?)
s = T(s)
s = [j for i,j in enumerate(s) if i not in nulls(s)]
# get title for primary category column
title = s[0][0]
# get date for secondary category column
date = [j[1] for j in s if str(j[0]) == 'date']
#
# combine columns into a single category variable (could also have left them separate)
#
cols=['Category']
s = T(s)
cols.extend(rbind(fill(s[2]),s[3])[1:])
s = s[4:len(s)]
s=T(s)
category = [str(i) for i in s[0]]
s=s[1:len(s)]
c1=[date for i in range(len(s[0]))] #create date column
c2=[title for i in range(len(s[0]))] #create title column
cols.insert(0,'title')
cols.insert(1,'date')
s.insert(0,c2)
s.insert(1,c1)
s = T(s)
Upvotes: 1
Views: 1165
Reputation: 567
This is just a suggestion: if it's possible that you can export your excel worksheet as a csv, you might want to have a look at numpy.genfromtxt
: http://docs.scipy.org/doc/numpy/user/basics.io.genfromtxt.html
It seems to have similar capabilities to pandas but without the steep learning curve. has delimiter
, autostrip
, missing_values
, filling_values
, column names
, and the columns are in numpy.array
form.
Upvotes: 1