user1926173
user1926173

Reputation:

Transform Python list into columnar data

I have a list of strings I've scraped and I'd like to chunk the strings into groups and then reshape it into columnar data. The variable titles aren't present for each group, however.

My list is called complist and looks like this:

[u'Intake Received Date:',
 u'9/11/2012',
 u'Intake ID:',
 u'CA00325127',
 u'Allegation Category:',
 u'Infection Control',
 u'Investigation Finding:',
 u'Substantiated',
 u'Intake Received Date:',
 u'5/14/2012',
 u'Intake ID:',
 u'CA00310421',
 u'Allegation Category:',
 u'Quality of Care/Treatment',
 u'Investigation Finding:',
 u'Substantiated',
 u'Intake Received Date:',
 u'8/15/2011',
 u'Intake ID:',
 u'CA00279396',
 u'Allegation Category:',
 u'Quality of Care/Treatment',
 u'Sub Categories:',
 u'Screening',
 u'Investigation Finding:',
 u'Unsubstantiated',]

And my goal is to make it look like this:

'Intake Received Date', 'Intake ID', 'Allegation Category', 'Sub Categories', 'Investigation Finding'
'9/11/2012', 'CA00325127', 'Infection Control', '', 'Substantiated'
'5/14/2012', 'CA00310421', 'Quality of Care/Treatment', '', 'Substantiated'
'8/15/2011', 'CA00279396', 'Quality of Care/Treatment', 'Screening', 'Unsubstantiated'

First thing I did was to break the list into chunks based on the starting element Intake Received Date

compgroup = []
for k, g in groupby(complist, key=lambda x:re.search(r'Intake Received Date', x)):
    if not k:
        compgroup.append(list(g))


#Intake Received Date was removed, so insert it back to beginning of each list:
for c in compgroup:
    c.insert(0, u'Intake Received Date')


#Create list of dicts to map the preceding titles to their respective data element:
dic = []
for c in compgroup:
    dic.append(dict(zip(*[iter(c)]*2)))

The next step would be to convert the list of dicts into columnar data, but at this point I feel my approach is overly complicated and that I must be missing something more elegant. I'd appreciate any guidance.

Upvotes: 2

Views: 136

Answers (1)

dawg
dawg

Reputation: 104111

Given:

data=[u'Intake Received Date:',
 u'9/11/2012',
 u'Intake ID:',
 u'CA00325127',
 u'Allegation Category:',
 u'Infection Control',
 u'Investigation Finding:',
 u'Substantiated',
 u'Intake Received Date:',
 u'5/14/2012',
 u'Intake ID:',
 u'CA00310421',
 u'Allegation Category:',
 u'Quality of Care/Treatment',
 u'Investigation Finding:',
 u'Substantiated',
 u'Intake Received Date:',
 u'8/15/2011',
 u'Intake ID:',
 u'CA00279396',
 u'Allegation Category:',
 u'Quality of Care/Treatment',
 u'Sub Categories:',
 u'Screening',
 u'Investigation Finding:',
 u'Unsubstantiated',]

Your method is actually pretty good. I edited it a bit. You don't need a regex, and you don't need to reinsert Intake Received Date

Try:

from itertools import groupby

headers=['Intake Received Date:', 'Intake ID:', 'Allegation Category:', 'Sub Categories:', 'Investigation Finding:']
sep='Intake Received Date:'
compgroup = []
for k, g in groupby(data, key=lambda x: x==sep):    
    if not k:
        compgroup.append([sep]+list(g))

print ', '.join(e[0:-1] for e in headers)    

for di in [dict(zip(*[iter(c)]*2)) for c in compgroup]:
    line=[]
    for h in headers:
        try:
            line.append(di[h])
        except KeyError:
            line.append('*')
    print ', '.join(line)  

Prints:

Intake Received Date, Intake ID, Allegation Category, Sub Categories, Investigation Finding
9/11/2012, CA00325127, Infection Control, *, Substantiated
5/14/2012, CA00310421, Quality of Care/Treatment, *, Substantiated
8/15/2011, CA00279396, Quality of Care/Treatment, Screening, Unsubstantiated   

Upvotes: 1

Related Questions