Mako212
Mako212

Reputation: 7302

Python: How can I optimize this CSV parsing loop?

I wrote this loop to parse a 1 million row .csv file. It works, but can only process about 7k lines/minute. Is there a reasonable way I can get this running faster?

The loop is currently transforming block of data into a row, and stripping out the extra characters, and writing each row to a new .csv file.

pattern = re.compile(r",{2,}")

with open("OceanData.csv") as infile, open("OceanParsed.csv","w", newline="") as fout:
    outfile = csv.writer(fout)
    data =[]
    for line in infile:
        if line.startswith("#--------------------------------------------------------------------------------"):
            outfile.writerow(data)
            continue
        for ch in ["[","]","'"," ","\n"]:
            if ch in line:
                line = line.replace(ch,"")
        for i in line:
            line =re.sub(pattern,",", line)
            continue

        if not line: continue
        data.append(line)

Sample data: http://www.sharecsv.com/s/674dc42035c29eb4f250b5c2365c8dc6/OceanParseTest.csv

Upvotes: 1

Views: 118

Answers (1)

glegoux
glegoux

Reputation: 3603

Don't reinvent the wheel to read csv file.

You can use pandas.

import pandas as pd

df = pd.read_csv('file.csv')

Or use csv standard library also.

To read a big csv file, if these methods above don't work. You can split your file into small files, create a process to read each file.

Your data sample.

I think your format file isn't a csv file. Then suppose that you have one section like this:

#--------------------------------------------------------------------------------,,,,,,
CAST                        ,,9285001,WOD Unique Cast Number,WOD code,,
NODC Cruise ID              ,,US-10209       ,,,,
Originators Station ID      ,,82,,,integer,
Originators Cruise ID       ,,               ,,,,
Latitude                    ,,-76.477,decimal degrees,,,
Longitude                   ,,166.3137,decimal degrees,,,
Year                        ,,1997,,,,
Month                       ,,1,,,,
Day                         ,,1,,,,
Time                        ,,3.9931,decimal hours (UT),,,
METADATA,,,,,,
Country                     ,,             US,NODC code,UNITED STATES,,
Accession Number            ,,520,NODC code,,,
Project                     ,,406,NODC code,RESEARCH ON OCEAN ATMOSPHERE VARIABILITY & ECOSYSTEM RESPON
SE IN ROSS SEA,,
Platform                    ,,3596,OCL code,NATHANIEL B. PALMER (Icebr.;c.s.WBP3210;built 03.1992;old c
.s.KUS1475;IMO900725,,
Institute                   ,,431,NODC code,US DOC NOAA NESDIS,,
Cast/Tow Number             ,,1,,,,
High resolution CTD - Bottle,,9182488,,,,
probe_type                  ,,7,OCL_code,bottle/rossette/net,,
scale            ,Temperature,103,WOD code,Temperature: ITS-90,,
Instrument       ,Temperature,411,WOD code,CTD: SBE 911plus (Sea-Bird Electronics, Inc.),
VARIABLES ,Depth     ,F,O,Temperatur ,F,O
UNITS     ,m         , , ,degrees C ,, 
Prof-Flag ,          ,0, ,          ,0, 
1,0,0, ,-1.591,0, 
2,5,0, ,-1.668,0, 
3,10,0, ,-1.702,0, 
4,15,0, ,-1.733,0, 
5,20,0, ,-1.746,0, 
6,25,0, ,-1.76,0, 
7,30,0, ,-1.773,0, 
8,35,0, ,-1.785,0, 
9,40,0, ,-1.796,0, 
10,45,0, ,-1.805,0, 
11,50,0, ,-1.813,0, 
12,55,0, ,-1.823,0, 
13,60,0, ,-1.832,0, 
14,65,0, ,-1.84,0, 
15,70,0, ,-1.848,0, 
16,75,0, ,-1.855,0, 
17,80,0, ,-1.861,0, 
18,85,0, ,-1.867,0, 
19,90,0, ,-1.873,0, 
20,95,0, ,-1.878,0, 
21,100,0, ,-1.882,0, 
22,125,0, ,-1.892,0, 
23,150,0, ,   ---0---,0, 
24,175,0, ,   ---0---,0, 
25,200,0, ,   ---0---,0, 
26,225,0, ,   ---0---,0, 
27,250,0, ,   ---0---,0, 
28,275,0, ,   ---0---,0, 
29,300,0, ,   ---0---,0, 
30,325,0, ,   ---0---,0, 
31,350,0, ,   ---0---,0, 
32,375,0, ,   ---0---,0, 
33,400,0, ,   ---0---,0, 
34,425,0, ,   ---0---,0, 
35,450,0, ,   ---0---,0, 
36,475,0, ,   ---0---,0, 
37,500,0, ,   ---0---,0, 
38,550,0, ,-1.898,0, 
END OF VARIABLES SECTION,,,,,,

Clean this section with :

format.sh:

#!/usr/bin/env bash
# use : bash format.sh pathname    

cat "$1" | \
    grep -v '^#\|^END' | \
    sed 's/,/ /g' | tr -s " " | sed 's/ /,/' 

To get :

CAST,9285001 WOD Unique Cast Number WOD code 
NODC,Cruise ID US-10209 
Originators,Station ID 82 integer 
Originators,Cruise ID 
Latitude,-76.477 decimal degrees 
Longitude,166.3137 decimal degrees 
Year,1997 
Month,1 
Day,1 
Time,3.9931 decimal hours (UT) 
METADATA,
Country,US NODC code UNITED STATES 
Accession,Number 520 NODC code 
Project,406 NODC code RESEARCH ON OCEAN ATMOSPHERE VARIABILITY & ECOSYSTEM RESPONSE IN ROSS SEA 
Platform,3596 OCL code NATHANIEL B. PALMER (Icebr.;c.s.WBP3210;built 03.1992;old c.s.KUS1475;IMO900725 
Institute,431 NODC code US DOC NOAA NESDIS 
Cast/Tow,Number 1 
High,resolution CTD - Bottle 9182488 
probe_type,7 OCL_code bottle/rossette/net 
scale,Temperature 103 WOD code Temperature: ITS-90 
Instrument,Temperature 411 WOD code CTD: SBE 911plus (Sea-Bird Electronics Inc.) 
VARIABLES,Depth F O Temperatur F O
UNITS,m degrees C 
Prof-Flag,0 0 
1,0 0 -1.591 0 
2,5 0 -1.668 0 
3,10 0 -1.702 0 
4,15 0 -1.733 0 
5,20 0 -1.746 0 
6,25 0 -1.76 0 
7,30 0 -1.773 0 
8,35 0 -1.785 0 
9,40 0 -1.796 0 
10,45 0 -1.805 0 
11,50 0 -1.813 0 
12,55 0 -1.823 0 
13,60 0 -1.832 0 
14,65 0 -1.84 0 
15,70 0 -1.848 0 
16,75 0 -1.855 0 
17,80 0 -1.861 0 
18,85 0 -1.867 0 
19,90 0 -1.873 0 
20,95 0 -1.878 0 
21,100 0 -1.882 0 
22,125 0 -1.892 0 
23,150 0 ---0--- 0 
24,175 0 ---0--- 0 
25,200 0 ---0--- 0 
26,225 0 ---0--- 0 
27,250 0 ---0--- 0 
28,275 0 ---0--- 0 
29,300 0 ---0--- 0 
30,325 0 ---0--- 0 
31,350 0 ---0--- 0 
32,375 0 ---0--- 0 
33,400 0 ---0--- 0 
34,425 0 ---0--- 0 
35,450 0 ---0--- 0 
36,475 0 ---0--- 0 
37,500 0 ---0--- 0 
38,550 0 -1.898 0 

If you are 1M of lines I suppose that you have around 15 000 sections.

I get that with :

for _ in `seq 1 15000`; do cat one_section.txt >> data.txt; done

Check:

grep -n ^# data.txt | cut -d : -f1 | wc -l
wc -l data.txt
ls -sh data.txt   

Give well 15 000 sections, 960000 lines, and 34MB.

....

Upvotes: 3

Related Questions