Reputation: 3
I'm learning Python, but I don't have lot of experience in programming. I'm trying to build a routine to import a CSV file and iterate each line in which there's an specific key and concatenate those lines in one single line.
Example:
CSV file:
'0001','key1','name'
'0002','key1','age'
'0001','key2','name'
'0002','key2','age'
The resulting file should be:
['0001','key1','name','0002','key1','age']
['0001','key2','name','0002','key2','age']
How can I achieve this?
Upvotes: 0
Views: 119
Reputation: 402
Assuming that your csv file does not contain the single quotes (and those are only for presentation here) this should work:
import pandas as pd
Data = pd.read_csv('Test.csv',header=None,dtype=str)
Result = Data.groupby(1).apply(lambda y: ','.join([s1 for s2 in y.values for s1 in s2]))
f = open('Result.csv','w')
for r in Result:
f.write(r+'\n')
f.close()
the output is stored in Result.csv
Upvotes: 0
Reputation: 2677
Read the CSV:
import csv
with open('my_csv.txt', 'rb') as f:
my_list = list(csv.reader(f))
At this point, my_list presumably resembles a list of lists, such as the following::
[['0001', 'key1', 'name'], ['0002', 'key1', 'age'], ['0001', 'key2', 'name'], ['0002', 'key2', 'age']]
Create a dict, with each key[number] from the lists corresponding to a key in the dict, and with each value in the dict corresponding to the concatenated lists for a particular key:
dict_of_lists = {}
for item in my_list:
_, key, _ = item
if key in dict_of_lists.keys():
dict_of_lists[key] = dict_of_lists[key] + item
else:
dict_of_lists[key] = item
If you don't care about the order of the list items:
dict_of_lists.values()
Output:
[['0001', 'key2', 'name', '0002', 'key2', 'age'], ['0001', 'key1', 'name', '0002', 'key1', 'age']]
If you care about the order:
from collections import OrderedDict
OrderedDict(sorted(dict_of_lists.items())).values()
Output:
[['0001', 'key1', 'name', '0002', 'key1', 'age'], ['0001', 'key2', 'name', '0002', 'key2', 'age']]
Upvotes: 3
Reputation: 51
If you can afford to store all entries in RAM, using a defaultdict to create 'buckets' of entries by key would be a way to do this (this assuming a file called 'file.csv'):
from collections import defaultdict
#this defaultdict acts as a Python dictionary, but creates an empty list
# automatically in case the key doesn't exist
entriesByKey = defaultdict(list)
with open("file.csv") as f:
for line in f.readlines():
#strips trailing whitespace and splits the line into a list
# using "," as a separator
entry = line.rstrip().split(",")
#the key is the second field in each entry
key = entry[1]
#concatenate entry to its respective key 'bucket'
entriesByKey[key] += entry
#Now, we create a list of concatenated lines by key, sorting them
# so that the keys appear in order
out = [entriesByKey[key] for key in sorted(entriesByKey.keys())]
#pretty-print the output :-)
import pprint
pprint.pprint(out)
The output for this program for your input would be:
[["'0001'", "'key1'", "'name'", "'0002'", "'key1'", "'age'"],
["'0001'", "'key2'", "'name'", "'0002'", "'key2'", "'age'"]]
All that's missing would be to strip the single quotes for each entry (and maybe format the output to your liking instead of just using pprint()
). If you can guarantee that your input is well-formed and that fields consistently have single-quotes (or, more precisely, that the first and last characters for each field in an entry are never relevant) you may do so by adding the following above the key = entry[1]
line:
entry = [field[1:-1] for field in entry]
This will strip the first and last characters of each field.
Upvotes: 1