Split multiline string and append to a new list

I have a csv file which has a set of records. One of the columns is a multiline string. I need to split these lines and duplicate the main list. For example, if the file contains the following records.

Row 1 - John, 2501, Admin/Partner/HR, TX
Row 2 - Max, 3489, Accountant/Admin/HR, MN

where Admin/Partner/HR and Accountant/Admin/HR are multiline strings. I need the output as below:

Row 1 - John, 2501, Admin, TX
Row 2 - John, 2501, Partner, TX
Row 3 - John, 2501, HR, TX
Row 4 - Max, 3489, Accountant, MN
Row 5 - Max, 3489, Admin, MN
Row 6 - Max, 3489, HR, MN

I tried the following function

def splitrows(list_old,col):
  count = 0;
  array_new = []
  for row in list_old:
    splitArray = []
    splitarray = row[col].split('\n\n')
    if(len(splitarray)>0):
      for eachLine in splitarray:
        row[col] = eachLine
        array_new.append(row)


  return array_new

If I use append, only the last row gets repeated. If I use array_new+=row, then each column is split and added to the list but I don't want this to happen.

Thanks in advance!

Upvotes: 1

Views: 207

Answers (3)

Foon
Foon

Reputation: 6458

@Reut Sharabani's answer is better as far as using the csv library is much cleaner than trying to roll your own. With that said, I'd also like to explain why you're seeing what you're seeing.

First off, you have a bogus line that had me concerned you had a case issue... splitArray = [] creates an empty array named splitArray; your next line creates a new array named splitarray and in python, that's a completely different variable.

More importantly, when you do row[col] = eachLine it changes the row value and then when you append it, it really only appends a reference to row and not a copy, such that when you change row, you change it for each version you appended to your array_new. The easiest way to fix this (which is also what Reut's answer is doing) is to do:

def splitrows(list_old,col):
  count = 0;
  array_new = []
  for row in list_old:
    splitarray = row[col].split('\n\n')
    if(len(splitarray)>0):
      for eachLine in splitarray:
        newrow = row[:] # create a copy of the row
        newrow[col] = eachLine
        array_new.append(newrow)


  return array_new

Upvotes: 0

Reut Sharabani
Reut Sharabani

Reputation: 31339

Since what you're parsing is (almost) a csv. You could use the csv module:

import csv

with open("test.csv") as f:
    # create a reader
    reader = csv.reader(f)
    for row in reader:
        # create a new row for each item in column 2
        for item in row[2].split('/'):
            # align items
            row[2] = ' ' + item.lstrip()
            print ','.join(row)

Output:

John, 2501, Admin, TX
John, 2501, Partner, TX
John, 2501, HR, TX
Max, 3489, Accountant, MN
Max, 3489, Admin, MN
Max, 3489, HR, MN

Upvotes: 2

Vivek Sable
Vivek Sable

Reputation: 10213

Algo:

  1. Use csv module to read and write csv file.
  2. Read input file and Open output file in write mode.
  3. Iterate every row from the Input file.
  4. If second item from row contains / i.e. multiple values the do step 5,6,7 and If not then do step 8.
  5. Split string by split() method.
  6. Iterate every split value by for loop.
  7. Create new row according to split value and write into output file.
  8. Write row into output file.

Input: input.csv

John, 2501, Admin/Partner/HR, TX
Max, 3489, Accountant/Admin/HR, MN

Code:

import csv

input_file = "input.csv"
output_file = "output.csv"

#- Open File and read by csv module.
with open(input_file, ) as fp:
    root = csv.reader(fp)
    #- Open File in write mode and write by csv module.
    with open(output_file, 'wb') as fp2:
        writer_root = csv.writer(fp2, delimiter=',')

        #- Iterate every row form the Input file.
        for row in root:
            #- check if second item from row contains / i.e. multiple values.
            if "/" in row[2]:
                #- Split by /
                tmp = row[2].split("/")
                for i in tmp:
                    new_row = [row[0], row[1], i, row[3]]
                    print "Debug 1: ", new_row
                    #- Write Row in New file.
                    writer_root.writerow(new_row)
            else:
                    writer_root.writerow(row)

Output:

Debug statements:

$ python 5.py 
Debug 1:  ['John', ' 2501', ' Admin', ' TX']
Debug 1:  ['John', ' 2501', 'Partner', ' TX']
Debug 1:  ['John', ' 2501', 'HR', ' TX']
Debug 1:  ['Max', ' 3489', ' Accountant', ' MN']
Debug 1:  ['Max', ' 3489', 'Admin', ' MN']
Debug 1:  ['Max', ' 3489', 'HR', ' MN']

Output File: output.csv

John, 2501, Admin, TX
John, 2501,Partner, TX
John, 2501,HR, TX
Max, 3489, Accountant, MN
Max, 3489,Admin, MN
Max, 3489,HR, MN

Upvotes: 0

Related Questions