Reputation: 23
I have a csv file which has a set of records. One of the columns is a multiline string. I need to split these lines and duplicate the main list. For example, if the file contains the following records.
Row 1 - John, 2501, Admin/Partner/HR, TX
Row 2 - Max, 3489, Accountant/Admin/HR, MN
where Admin/Partner/HR
and Accountant/Admin/HR
are multiline strings.
I need the output as below:
Row 1 - John, 2501, Admin, TX
Row 2 - John, 2501, Partner, TX
Row 3 - John, 2501, HR, TX
Row 4 - Max, 3489, Accountant, MN
Row 5 - Max, 3489, Admin, MN
Row 6 - Max, 3489, HR, MN
I tried the following function
def splitrows(list_old,col):
count = 0;
array_new = []
for row in list_old:
splitArray = []
splitarray = row[col].split('\n\n')
if(len(splitarray)>0):
for eachLine in splitarray:
row[col] = eachLine
array_new.append(row)
return array_new
If I use append, only the last row gets repeated. If I use array_new+=row, then each column is split and added to the list but I don't want this to happen.
Thanks in advance!
Upvotes: 1
Views: 207
Reputation: 6458
@Reut Sharabani's answer is better as far as using the csv library is much cleaner than trying to roll your own. With that said, I'd also like to explain why you're seeing what you're seeing.
First off, you have a bogus line that had me concerned you had a case issue... splitArray = []
creates an empty array named splitArray
; your next line creates a new array named splitarray
and in python, that's a completely different variable.
More importantly, when you do row[col] = eachLine
it changes the row value and then when you append it, it really only appends a reference to row
and not a copy, such that when you change row
, you change it for each version you appended to your array_new
. The easiest way to fix this (which is also what Reut's answer is doing) is to do:
def splitrows(list_old,col):
count = 0;
array_new = []
for row in list_old:
splitarray = row[col].split('\n\n')
if(len(splitarray)>0):
for eachLine in splitarray:
newrow = row[:] # create a copy of the row
newrow[col] = eachLine
array_new.append(newrow)
return array_new
Upvotes: 0
Reputation: 31339
Since what you're parsing is (almost) a csv. You could use the csv module
:
import csv
with open("test.csv") as f:
# create a reader
reader = csv.reader(f)
for row in reader:
# create a new row for each item in column 2
for item in row[2].split('/'):
# align items
row[2] = ' ' + item.lstrip()
print ','.join(row)
Output:
John, 2501, Admin, TX
John, 2501, Partner, TX
John, 2501, HR, TX
Max, 3489, Accountant, MN
Max, 3489, Admin, MN
Max, 3489, HR, MN
Upvotes: 2
Reputation: 10213
Algo:
csv
module to read and write csv file./
i.e. multiple values the do step 5,6,7
and If not then do step 8
.split()
method.for
loop.Input: input.csv
John, 2501, Admin/Partner/HR, TX
Max, 3489, Accountant/Admin/HR, MN
Code:
import csv
input_file = "input.csv"
output_file = "output.csv"
#- Open File and read by csv module.
with open(input_file, ) as fp:
root = csv.reader(fp)
#- Open File in write mode and write by csv module.
with open(output_file, 'wb') as fp2:
writer_root = csv.writer(fp2, delimiter=',')
#- Iterate every row form the Input file.
for row in root:
#- check if second item from row contains / i.e. multiple values.
if "/" in row[2]:
#- Split by /
tmp = row[2].split("/")
for i in tmp:
new_row = [row[0], row[1], i, row[3]]
print "Debug 1: ", new_row
#- Write Row in New file.
writer_root.writerow(new_row)
else:
writer_root.writerow(row)
Output:
Debug statements:
$ python 5.py
Debug 1: ['John', ' 2501', ' Admin', ' TX']
Debug 1: ['John', ' 2501', 'Partner', ' TX']
Debug 1: ['John', ' 2501', 'HR', ' TX']
Debug 1: ['Max', ' 3489', ' Accountant', ' MN']
Debug 1: ['Max', ' 3489', 'Admin', ' MN']
Debug 1: ['Max', ' 3489', 'HR', ' MN']
Output File: output.csv
John, 2501, Admin, TX
John, 2501,Partner, TX
John, 2501,HR, TX
Max, 3489, Accountant, MN
Max, 3489,Admin, MN
Max, 3489,HR, MN
Upvotes: 0