Reputation: 2983
I am working on a script that parses a text file in an attempt to normalize it enough to be able to insert it in to a DB. The data represents articles written by 1 or more authors. The problem I am having is that because there is not a fixed number of authors, I get a variable number of columns in my output text file. eg.
author1, author2, author3, this is the title of the article
author1, author2, this is the title of the article
author1, author2, author3, author4, this is the title of the article
These results give me a max column number of 5. So, for the first 2 articles I will need to add blank columns so that the output has an even number of columns. What would be the best way to do this? My input text is tab delimited and I can iterate through them fairly easily by splitting on the tab.
Upvotes: 0
Views: 194
Reputation: 3326
Assuming you already have the max number of columns and already have them separated into lists (which I'm going to assume you put into a list of their own), you should be able to just use list.insert(-1,item) to add empty columns:
def columnize(mylists, maxcolumns):
for i in mylists:
while len(i) < maxcolumns:
i.insert(-1,None)
mylists = [["author1","author2","author3","this is the title of the article"],
["author1","author2","this is the title of the article"],
["author1","author2","author3","author4","this is the title of the article"]]
columnize(mylists,5)
print mylists
[['author1', 'author2', 'author3', None, 'this is the title of the article'], ['author1', 'author2', None, None, 'this is the title of the article'], ['author1', 'author2', 'author3', 'author4', 'this is the title of the article']]
Alternative version that doesn't destroy your original list, using list comprehensions:
def columnize(mylists, maxcolumns):
return [j[:-1]+([None]*(maxcolumns-len(j)))+j[-1:] for j in mylists]
print columnize(mylists,5)
[['author1', 'author2', 'author3', None, 'this is the title of the article'], ['author1', 'author2', None, None, 'this is the title of the article'], ['author1', 'author2', 'author3', 'author4', 'this is the title of the article']]
Upvotes: 2
Reputation: 48720
Forgive me if I've misunderstood, but it sounds like you're approaching the problem in a difficult way. It's quite easy to convert your text file into a dictionary that maps title to a set of authors:
>>> lines = ["auth1, auth2, auth3, article1", "auth1, auth2, article2","auth1, article3"]
>>> d = dict((x[-1], x[:-1]) for x in [line.split(', ') for line in lines])
>>> d
{'article2': ['auth1', 'auth2'], 'article3': ['auth1'], 'article1': ['auth1', 'auth2', 'auth3']}
>>> total_articles = len(d)
>>> total_articles
3
>>> max_authors = max(len(val) for val in d.values())
>>> max_authors
3
>>> for k,v in d.iteritems():
... print k
... print v + [None]*(max_authors-len(v))
...
article2
['auth1', 'auth2', None]
article3
['auth1', None, None]
article1
['auth1', 'auth2', 'auth3']
Then, if you really want to, you can output this data using the csv module that's built in to python. Or, you could directly output the SQL that you're going to need.
You are opening the same file many times, and reading it many times, just to get counts that you can derive from the data in memory. Please don't read the file multiple times for these purposes.
Upvotes: 1