Aspire
Aspire

Reputation: 417

Parsing through a raw data file

I'm learning about local I/O and how to read and write files. I'm currently working on an assignment, where I have to parse through a semi-colon separated file, convert the semicolons to commas, and replace any values that have commas in them with semicolons. To give you a better idea, here's a piece of the raw data I'm working with.


String;Categorical;Categorical;Int;Int;Int;Int;Float;Float;Int;Int;Int;Int;Float;Float;Float

100% Bran;N;C;70;4;1;130;10;5;6;280;25;3;1;0.33;68.402973

100% Natural Bran;Q;C;120;3;5;15;2;8;8;135;0;3;1;1;33.983679

All-Bran;K;C;70;4;1;260;9;7;5;320;25;3;1;0.33;59.425505

All-Bran with Extra Fiber;K;C;50;4;0;140;14;8;0;330;25;3;1;0.5;93.704912

Almond Delight;R;C;110;2;2;200;1;14;8;-1;25;3;1;0.75;34.384843

Apple Cinnamon Cheerios;G;C;110;2;2;180;1.5;10.5;10;70;25;1;1;0.75;29.509541

Froot Loops;K;C;110;2;1;125;1;11;13;30;25;2;1;1;32.207582

Frosted Flakes;K;C;110;1;0;200;1;14;11;25;25;1;1;0.75;31.435973

Frosted Mini-Wheats;K;C;100;3;0;0;3;14;7;100;25;2;1;0.8;58.345141

Fruit & Fibre Dates, Walnuts, and Oats;P;C;120;3;2;160;5;12;10;200;25;3;1.25;0.67;40.917047

The goal is to separate the values with commas. For any values that have a comma in them, such as the last value - "Fruit & Fibre Dates, Walnuts, and Oats", I want to replace those commas with semicolons. I cannot import any helper libraries, such as csv or pandas. I'm not sure how to do this assignment, but here is the code I have so far:

def convert_table(filename_in, filename_out):
  with open('cereal.scsv', 'r') as filename_in:
    for line in filename_in:
      print(line, end='\n')
    with open('cereal.scsv', 'w') as filename_out:
      for line in filename_in:
        newLine = line.replace(";", ",")
        filename_out.write(newLine)
  return True

Any advice or tips are much appreciated!

Upvotes: 0

Views: 1109

Answers (3)

Neda Peyrone
Neda Peyrone

Reputation: 340

You can separate the semicolon with pandas. Please try this.

Python code:

import pandas as pd

def replace(x):
    x = x.replace(",", ";")
    return str(x)

df = pd.read_csv(input_file, sep=';', encoding='utf-8', header=None, dtype=str).fillna('')
df[0] = df[0].apply(replace)
print (df)
df.to_csv(output_file, sep=',', encoding='utf-8', index=False, header=False)

Output:

String,Categorical,Categorical,Int,Int,Int,Int,Float,Float,Int,Int,Int,Int,Float,Float,Float
100% Bran,N,C,70,4,1,130,10,5,6,280,25,3,1,0.33,68.402973
100% Natural Bran,Q,C,120,3,5,15,2,8,8,135,0,3,1,1,33.983679
All-Bran,K,C,70,4,1,260,9,7,5,320,25,3,1,0.33,59.425505
All-Bran with Extra Fiber,K,C,50,4,0,140,14,8,0,330,25,3,1,0.5,93.704912
Almond Delight,R,C,110,2,2,200,1,14,8,-1,25,3,1,0.75,34.384843
Apple Cinnamon Cheerios,G,C,110,2,2,180,1.5,10.5,10,70,25,1,1,0.75,29.509541
Froot Loops,K,C,110,2,1,125,1,11,13,30,25,2,1,1,32.207582
Frosted Flakes,K,C,110,1,0,200,1,14,11,25,25,1,1,0.75,31.435973
Frosted Mini-Wheats,K,C,100,3,0,0,3,14,7,100,25,2,1,0.8,58.345141
Fruit & Fibre Dates; Walnuts; and Oats,P,C,120,3,2,160,5,12,10,200,25,3,1.25,0.67,40.917047

Upvotes: 1

RightmireM
RightmireM

Reputation: 2492

# Open the file out first, so you dont keep reopening and reclosing at every line
# You shouldn't be trying to read and write to the same file in the same loop
with open('cereal.scsv.out', 'w') as filename_out:  # Outfile name changed
    with open('cereal.scsv', 'r') as filename_in:
        for sentence in filename_in:
            print("----------------")
            print("Orig sentence =", sentence)
            # Split the sentence into a list, broken at the ";"
            wordlist = sentence.split(";")
            # Now cycle through each word/phrase in the wordlist, and replace the commas
            # Add them one by one to a new wordlist
            newwordlist = []
            for word in wordlist: 
                newword = word.replace(",", ";")
                newwordlist.append(newword)
            # And rejoin all the words/phrases, using a comma as the joiner
            newsentence = ','.join(newwordlist)
            print("newsentence =", newsentence )
            filename_out.write(newsentence )

OUTPUT:

----------------
Orig sentence = String;Categorical;Categorical;Int;Int;Int;Int;Float;Float;Int;Int;Int;Int;Float;Float;Float
newsentence = String,Categorical,Categorical,Int,Int,Int,Int,Float,Float,Int,Int,Int,Int,Float,Float,Float
----------------
Orig sentence = 100% Bran;N;C;70;4;1;130;10;5;6;280;25;3;1;0.33;68.402973
newsentence = 100% Bran,N,C,70,4,1,130,10,5,6,280,25,3,1,0.33,68.402973
----------------
Orig sentence = 100% Natural Bran;Q;C;120;3;5;15;2;8;8;135;0;3;1;1;33.983679
newsentence = 100% Natural Bran,Q,C,120,3,5,15,2,8,8,135,0,3,1,1,33.983679
----------------
Orig sentence = All-Bran;K;C;70;4;1;260;9;7;5;320;25;3;1;0.33;59.425505
newsentence = All-Bran,K,C,70,4,1,260,9,7,5,320,25,3,1,0.33,59.425505
----------------
Orig sentence = All-Bran with Extra Fiber;K;C;50;4;0;140;14;8;0;330;25;3;1;0.5;93.704912
newsentence = All-Bran with Extra Fiber,K,C,50,4,0,140,14,8,0,330,25,3,1,0.5,93.704912
----------------
Orig sentence = Almond Delight;R;C;110;2;2;200;1;14;8;-1;25;3;1;0.75;34.384843
newsentence = Almond Delight,R,C,110,2,2,200,1,14,8,-1,25,3,1,0.75,34.384843
----------------
Orig sentence = Apple Cinnamon Cheerios;G;C;110;2;2;180;1.5;10.5;10;70;25;1;1;0.75;29.509541
newsentence = Apple Cinnamon Cheerios,G,C,110,2,2,180,1.5,10.5,10,70,25,1,1,0.75,29.509541
----------------
Orig sentence = Froot Loops;K;C;110;2;1;125;1;11;13;30;25;2;1;1;32.207582
newsentence = Froot Loops,K,C,110,2,1,125,1,11,13,30,25,2,1,1,32.207582
----------------
Orig sentence = Frosted Flakes;K;C;110;1;0;200;1;14;11;25;25;1;1;0.75;31.435973
newsentence = Frosted Flakes,K,C,110,1,0,200,1,14,11,25,25,1,1,0.75,31.435973
----------------
Orig sentence = Frosted Mini-Wheats;K;C;100;3;0;0;3;14;7;100;25;2;1;0.8;58.345141
newsentence = Frosted Mini-Wheats,K,C,100,3,0,0,3,14,7,100,25,2,1,0.8,58.345141
----------------
Orig sentence = Fruit & Fibre Dates, Walnuts, and Oats;P;C;120;3;2;160;5;12;10;200;25;3;1.25;0.67;40.917047
newsentence = Fruit & Fibre Dates; Walnuts; and Oats,P,C,120,3,2,160,5,12,10,200,25,3,1.25,0.67,40.917047

If you want to get fancy and impress your teacher, you can replace some loops with a single line, like...

# newwordlist = []
# for word in wordlist: 
#     newword = word.replace(",", ";")
#     newwordlist.append(newword)

newwordlist = [ word.replace(",", ";") for word in wordlist ]

Upvotes: 2

hoodakaushal
hoodakaushal

Reputation: 1293

You can't straight up replace semi colons with commas - because then you don't know which of the commas are actually commas that need to be converted back to semicolons, and which commas used to be semicolons and should remain a comma.

What you need to do is split the line based on semicolon, replace each comma with semicolon for each string in the split array, and then joint the array again, this time using comma.

Upvotes: 1

Related Questions