Reputation:
I have a list of strings I've scraped and I'd like to chunk the strings into groups and then reshape it into columnar data. The variable titles aren't present for each group, however.
My list is called complist
and looks like this:
[u'Intake Received Date:',
u'9/11/2012',
u'Intake ID:',
u'CA00325127',
u'Allegation Category:',
u'Infection Control',
u'Investigation Finding:',
u'Substantiated',
u'Intake Received Date:',
u'5/14/2012',
u'Intake ID:',
u'CA00310421',
u'Allegation Category:',
u'Quality of Care/Treatment',
u'Investigation Finding:',
u'Substantiated',
u'Intake Received Date:',
u'8/15/2011',
u'Intake ID:',
u'CA00279396',
u'Allegation Category:',
u'Quality of Care/Treatment',
u'Sub Categories:',
u'Screening',
u'Investigation Finding:',
u'Unsubstantiated',]
And my goal is to make it look like this:
'Intake Received Date', 'Intake ID', 'Allegation Category', 'Sub Categories', 'Investigation Finding'
'9/11/2012', 'CA00325127', 'Infection Control', '', 'Substantiated'
'5/14/2012', 'CA00310421', 'Quality of Care/Treatment', '', 'Substantiated'
'8/15/2011', 'CA00279396', 'Quality of Care/Treatment', 'Screening', 'Unsubstantiated'
First thing I did was to break the list into chunks based on the starting element Intake Received Date
compgroup = []
for k, g in groupby(complist, key=lambda x:re.search(r'Intake Received Date', x)):
if not k:
compgroup.append(list(g))
#Intake Received Date was removed, so insert it back to beginning of each list:
for c in compgroup:
c.insert(0, u'Intake Received Date')
#Create list of dicts to map the preceding titles to their respective data element:
dic = []
for c in compgroup:
dic.append(dict(zip(*[iter(c)]*2)))
The next step would be to convert the list of dicts into columnar data, but at this point I feel my approach is overly complicated and that I must be missing something more elegant. I'd appreciate any guidance.
Upvotes: 2
Views: 136
Reputation: 104111
Given:
data=[u'Intake Received Date:',
u'9/11/2012',
u'Intake ID:',
u'CA00325127',
u'Allegation Category:',
u'Infection Control',
u'Investigation Finding:',
u'Substantiated',
u'Intake Received Date:',
u'5/14/2012',
u'Intake ID:',
u'CA00310421',
u'Allegation Category:',
u'Quality of Care/Treatment',
u'Investigation Finding:',
u'Substantiated',
u'Intake Received Date:',
u'8/15/2011',
u'Intake ID:',
u'CA00279396',
u'Allegation Category:',
u'Quality of Care/Treatment',
u'Sub Categories:',
u'Screening',
u'Investigation Finding:',
u'Unsubstantiated',]
Your method is actually pretty good. I edited it a bit. You don't need a regex, and you don't need to reinsert Intake Received Date
Try:
from itertools import groupby
headers=['Intake Received Date:', 'Intake ID:', 'Allegation Category:', 'Sub Categories:', 'Investigation Finding:']
sep='Intake Received Date:'
compgroup = []
for k, g in groupby(data, key=lambda x: x==sep):
if not k:
compgroup.append([sep]+list(g))
print ', '.join(e[0:-1] for e in headers)
for di in [dict(zip(*[iter(c)]*2)) for c in compgroup]:
line=[]
for h in headers:
try:
line.append(di[h])
except KeyError:
line.append('*')
print ', '.join(line)
Prints:
Intake Received Date, Intake ID, Allegation Category, Sub Categories, Investigation Finding
9/11/2012, CA00325127, Infection Control, *, Substantiated
5/14/2012, CA00310421, Quality of Care/Treatment, *, Substantiated
8/15/2011, CA00279396, Quality of Care/Treatment, Screening, Unsubstantiated
Upvotes: 1