CaeBrida
CaeBrida

Reputation: 3

Concatenate lines with similar key

I'm learning Python, but I don't have lot of experience in programming. I'm trying to build a routine to import a CSV file and iterate each line in which there's an specific key and concatenate those lines in one single line.

Example:

CSV file:

'0001','key1','name'
'0002','key1','age'
'0001','key2','name'
'0002','key2','age'

The resulting file should be:

['0001','key1','name','0002','key1','age']
['0001','key2','name','0002','key2','age']

How can I achieve this?

Upvotes: 0

Views: 119

Answers (3)

JavNoor
JavNoor

Reputation: 402

Assuming that your csv file does not contain the single quotes (and those are only for presentation here) this should work:

import pandas as pd
Data = pd.read_csv('Test.csv',header=None,dtype=str)
Result = Data.groupby(1).apply(lambda y: ','.join([s1 for s2 in y.values for s1 in s2]))
f = open('Result.csv','w')
for r in Result:
   f.write(r+'\n') 
f.close() 

the output is stored in Result.csv

Upvotes: 0

Boa
Boa

Reputation: 2677

Read the CSV:

import csv

with open('my_csv.txt', 'rb') as f:
    my_list = list(csv.reader(f))

At this point, my_list presumably resembles a list of lists, such as the following::

[['0001', 'key1', 'name'], ['0002', 'key1', 'age'], ['0001', 'key2', 'name'], ['0002', 'key2', 'age']]

Create a dict, with each key[number] from the lists corresponding to a key in the dict, and with each value in the dict corresponding to the concatenated lists for a particular key:

dict_of_lists = {}

for item in my_list:
    _, key, _ = item
    if key in dict_of_lists.keys():
        dict_of_lists[key] = dict_of_lists[key] + item
    else:
        dict_of_lists[key] = item

If you don't care about the order of the list items:

dict_of_lists.values()

Output:

[['0001', 'key2', 'name', '0002', 'key2', 'age'], ['0001', 'key1', 'name', '0002', 'key1', 'age']]

If you care about the order:

from collections import OrderedDict
OrderedDict(sorted(dict_of_lists.items())).values()

Output:

[['0001', 'key1', 'name', '0002', 'key1', 'age'], ['0001', 'key2', 'name', '0002', 'key2', 'age']]

Upvotes: 3

Marco Kawajiri
Marco Kawajiri

Reputation: 51

If you can afford to store all entries in RAM, using a defaultdict to create 'buckets' of entries by key would be a way to do this (this assuming a file called 'file.csv'):

from collections import defaultdict

#this defaultdict acts as a Python dictionary, but creates an empty list
# automatically in case the key doesn't exist
entriesByKey = defaultdict(list)

with open("file.csv") as f:
    for line in f.readlines():
        #strips trailing whitespace and splits the line into a list
        # using "," as a separator
        entry = line.rstrip().split(",")
        #the key is the second field in each entry
        key = entry[1]
        #concatenate entry to its respective key 'bucket'
        entriesByKey[key] += entry

#Now, we create a list of concatenated lines by key, sorting them
# so that the keys appear in order
out = [entriesByKey[key] for key in sorted(entriesByKey.keys())]

#pretty-print the output :-)
import pprint
pprint.pprint(out)

The output for this program for your input would be:

[["'0001'", "'key1'", "'name'", "'0002'", "'key1'", "'age'"],
 ["'0001'", "'key2'", "'name'", "'0002'", "'key2'", "'age'"]]

All that's missing would be to strip the single quotes for each entry (and maybe format the output to your liking instead of just using pprint()). If you can guarantee that your input is well-formed and that fields consistently have single-quotes (or, more precisely, that the first and last characters for each field in an entry are never relevant) you may do so by adding the following above the key = entry[1] line:

entry = [field[1:-1] for field in entry]

This will strip the first and last characters of each field.

Upvotes: 1

Related Questions