Reputation: 7302
I wrote this loop to parse a 1 million row .csv file. It works, but can only process about 7k lines/minute. Is there a reasonable way I can get this running faster?
The loop is currently transforming block of data into a row, and stripping out the extra characters, and writing each row to a new .csv file.
pattern = re.compile(r",{2,}")
with open("OceanData.csv") as infile, open("OceanParsed.csv","w", newline="") as fout:
outfile = csv.writer(fout)
data =[]
for line in infile:
if line.startswith("#--------------------------------------------------------------------------------"):
outfile.writerow(data)
continue
for ch in ["[","]","'"," ","\n"]:
if ch in line:
line = line.replace(ch,"")
for i in line:
line =re.sub(pattern,",", line)
continue
if not line: continue
data.append(line)
Sample data: http://www.sharecsv.com/s/674dc42035c29eb4f250b5c2365c8dc6/OceanParseTest.csv
Upvotes: 1
Views: 118
Reputation: 3603
Don't reinvent the wheel to read csv file.
You can use pandas.
import pandas as pd
df = pd.read_csv('file.csv')
Or use csv standard library also.
To read a big csv file, if these methods above don't work. You can split your file into small files, create a process to read each file.
Your data sample.
I think your format file isn't a csv file. Then suppose that you have one section like this:
#--------------------------------------------------------------------------------,,,,,,
CAST ,,9285001,WOD Unique Cast Number,WOD code,,
NODC Cruise ID ,,US-10209 ,,,,
Originators Station ID ,,82,,,integer,
Originators Cruise ID ,, ,,,,
Latitude ,,-76.477,decimal degrees,,,
Longitude ,,166.3137,decimal degrees,,,
Year ,,1997,,,,
Month ,,1,,,,
Day ,,1,,,,
Time ,,3.9931,decimal hours (UT),,,
METADATA,,,,,,
Country ,, US,NODC code,UNITED STATES,,
Accession Number ,,520,NODC code,,,
Project ,,406,NODC code,RESEARCH ON OCEAN ATMOSPHERE VARIABILITY & ECOSYSTEM RESPON
SE IN ROSS SEA,,
Platform ,,3596,OCL code,NATHANIEL B. PALMER (Icebr.;c.s.WBP3210;built 03.1992;old c
.s.KUS1475;IMO900725,,
Institute ,,431,NODC code,US DOC NOAA NESDIS,,
Cast/Tow Number ,,1,,,,
High resolution CTD - Bottle,,9182488,,,,
probe_type ,,7,OCL_code,bottle/rossette/net,,
scale ,Temperature,103,WOD code,Temperature: ITS-90,,
Instrument ,Temperature,411,WOD code,CTD: SBE 911plus (Sea-Bird Electronics, Inc.),
VARIABLES ,Depth ,F,O,Temperatur ,F,O
UNITS ,m , , ,degrees C ,,
Prof-Flag , ,0, , ,0,
1,0,0, ,-1.591,0,
2,5,0, ,-1.668,0,
3,10,0, ,-1.702,0,
4,15,0, ,-1.733,0,
5,20,0, ,-1.746,0,
6,25,0, ,-1.76,0,
7,30,0, ,-1.773,0,
8,35,0, ,-1.785,0,
9,40,0, ,-1.796,0,
10,45,0, ,-1.805,0,
11,50,0, ,-1.813,0,
12,55,0, ,-1.823,0,
13,60,0, ,-1.832,0,
14,65,0, ,-1.84,0,
15,70,0, ,-1.848,0,
16,75,0, ,-1.855,0,
17,80,0, ,-1.861,0,
18,85,0, ,-1.867,0,
19,90,0, ,-1.873,0,
20,95,0, ,-1.878,0,
21,100,0, ,-1.882,0,
22,125,0, ,-1.892,0,
23,150,0, , ---0---,0,
24,175,0, , ---0---,0,
25,200,0, , ---0---,0,
26,225,0, , ---0---,0,
27,250,0, , ---0---,0,
28,275,0, , ---0---,0,
29,300,0, , ---0---,0,
30,325,0, , ---0---,0,
31,350,0, , ---0---,0,
32,375,0, , ---0---,0,
33,400,0, , ---0---,0,
34,425,0, , ---0---,0,
35,450,0, , ---0---,0,
36,475,0, , ---0---,0,
37,500,0, , ---0---,0,
38,550,0, ,-1.898,0,
END OF VARIABLES SECTION,,,,,,
Clean this section with :
format.sh:
#!/usr/bin/env bash
# use : bash format.sh pathname
cat "$1" | \
grep -v '^#\|^END' | \
sed 's/,/ /g' | tr -s " " | sed 's/ /,/'
To get :
CAST,9285001 WOD Unique Cast Number WOD code
NODC,Cruise ID US-10209
Originators,Station ID 82 integer
Originators,Cruise ID
Latitude,-76.477 decimal degrees
Longitude,166.3137 decimal degrees
Year,1997
Month,1
Day,1
Time,3.9931 decimal hours (UT)
METADATA,
Country,US NODC code UNITED STATES
Accession,Number 520 NODC code
Project,406 NODC code RESEARCH ON OCEAN ATMOSPHERE VARIABILITY & ECOSYSTEM RESPONSE IN ROSS SEA
Platform,3596 OCL code NATHANIEL B. PALMER (Icebr.;c.s.WBP3210;built 03.1992;old c.s.KUS1475;IMO900725
Institute,431 NODC code US DOC NOAA NESDIS
Cast/Tow,Number 1
High,resolution CTD - Bottle 9182488
probe_type,7 OCL_code bottle/rossette/net
scale,Temperature 103 WOD code Temperature: ITS-90
Instrument,Temperature 411 WOD code CTD: SBE 911plus (Sea-Bird Electronics Inc.)
VARIABLES,Depth F O Temperatur F O
UNITS,m degrees C
Prof-Flag,0 0
1,0 0 -1.591 0
2,5 0 -1.668 0
3,10 0 -1.702 0
4,15 0 -1.733 0
5,20 0 -1.746 0
6,25 0 -1.76 0
7,30 0 -1.773 0
8,35 0 -1.785 0
9,40 0 -1.796 0
10,45 0 -1.805 0
11,50 0 -1.813 0
12,55 0 -1.823 0
13,60 0 -1.832 0
14,65 0 -1.84 0
15,70 0 -1.848 0
16,75 0 -1.855 0
17,80 0 -1.861 0
18,85 0 -1.867 0
19,90 0 -1.873 0
20,95 0 -1.878 0
21,100 0 -1.882 0
22,125 0 -1.892 0
23,150 0 ---0--- 0
24,175 0 ---0--- 0
25,200 0 ---0--- 0
26,225 0 ---0--- 0
27,250 0 ---0--- 0
28,275 0 ---0--- 0
29,300 0 ---0--- 0
30,325 0 ---0--- 0
31,350 0 ---0--- 0
32,375 0 ---0--- 0
33,400 0 ---0--- 0
34,425 0 ---0--- 0
35,450 0 ---0--- 0
36,475 0 ---0--- 0
37,500 0 ---0--- 0
38,550 0 -1.898 0
If you are 1M of lines I suppose that you have around 15 000 sections.
I get that with :
for _ in `seq 1 15000`; do cat one_section.txt >> data.txt; done
Check:
grep -n ^# data.txt | cut -d : -f1 | wc -l
wc -l data.txt
ls -sh data.txt
Give well 15 000 sections, 960000 lines, and 34MB.
....
Upvotes: 3