Reputation: 61
I've read through some of the previous questions on speed up processing of large CSV data. I've implement some of the ideas and i got some improvement on processing time. However i still need to further cut down the processing time hopefully someone can help me.
I think my code is too long, I'll try to simplify. Here is what my code suppose to do:
1. Read through a csv file.
2. Group the data by first column; calculate total sum of each column and return the result.
Example (Raw Data):
A B C
1 2 3
1 2 3
2 4 4
2 4 4
Result:
A B C
1 4 6
2 8 8
Note: My actual data will be 100MB file with 630 columns and 29000 rows, total 18.27M records.
Here is how i achieve it:
Method 1:
1. Read a csv file through Filestream.
2. Use Split to split the returned string and process line by line, field by field.
3. Storing the result in an array and save the result in a text file.
Note on Method1: Time to process the data using this method takes ~1 min 20 secs.
Method 2:
1. Read a csv file through Filestream.
2. Feed the data into different threads before start process. (For now i feed 100 lines of data into different thread, fix 5 threads for now due to CPU resource constraint)
3. Use Split to split the returned string and process line by line, field by field in each thread.
4. Join all result from every threads and store in an array. Save the result in text file.
Note on Method 2: Time to process the data using this method takes ~50 secs.
So i got ~30secs improvement migrating from Method 1 to Method 2. I was wondering whether what i can do to further improve the process time. I've tried to cut down the data into smaller section like 100 lines x 100 columns and process it but the time to process the data become longer instead.
Hopefully some one can help me on this.
Thank you in advance.
Edit:
Here is my code for Method 2 (I'll skip Method 1 as i'm not using it already), I have a subroutine that manage the assignment of threads for every 100 lines read from filestream, execute each threads and return the result, finally update the all the results into single array before write the result into file. I tried to make the code as simple as possible. Hopefully this will give more idea to you all on how i process my data.
'Subroutine that assign smaller section of raw data into different threads
Sub process_control(byval filename as string)
Dim sread As New FileStream(filename, FileMode.Open, FileAccess.Read, FileShare.Read)
Dim read As New StreamReader(sread)
Dim t1 As System.Threading.Thread
Dim value, data1(), data2(), data3(), data4(), data5(), threadid(), result1(0), result2(0), result3(0), result4(0), result5(0) As String
Dim row as integer
Dim rowlimit as integer = 99
Dim check1 as boolean = true
row = 0
check = false
ReDim data1(rowlimit), data2(rowlimit), data3(rowlimit), data4(rowlimit), data5(rowlimit), threadid(4)
do value = read.ReadLine If row < rowlimit + 1 then If data1(rowlimit) = "" Then data1(row) = value ElseIf data2(rowlimit) = "" Then data2(row) = value ElseIf data3(rowlimit) = "" Then data3(row) = value ElseIf data4(rowlimit) = "" Then data4(row) = value ElseIf data5(rowlimit) = "" Then data5(row) = value End If Else If data1(rowlimit) <> "" And data2(rowlimit) = "" And data3(rowlimit) = "" And data4(rowlimit) = "" And data5(rowlimit) = "" Then threadid(0) = "" t1 = New Threading.Thread(Sub() result1 = process(data1).Clone threadid(0) = System.Threading.Thread.CurrentThread.ManagedThreadId End Sub) t1.Start() row = 0 data2(row) = value ElseIf data1(rowlimit) <> "" And data2(rowlimit) <> "" And data3(rowlimit) = "" And data4(rowlimit) = "" And data5(rowlimit) = "" Then threadid(1) = "" t1 = New Threading.Thread(Sub() result2 = process(data2).Clone threadid(1) = System.Threading.Thread.CurrentThread.ManagedThreadId End Sub) t1.Start() row = 0 data3(row) = value ElseIf data1(rowlimit) <> "" And data2(rowlimit) <> "" And data3(rowlimit) <> "" And data4(rowlimit) = "" And data5(rowlimit) = "" Then threadid(2) = "" t1 = New Threading.Thread(Sub() result3 = process(data3).Clone threadid(2) = System.Threading.Thread.CurrentThread.ManagedThreadId End Sub) t1.Start() row = 0 data4(row) = value ElseIf data1(rowlimit) <> "" And data2(rowlimit) <> "" And data3(rowlimit) <> "" And data4(rowlimit) <> "" And data5(rowlimit) = "" Then threadid(3) = "" t1 = New Threading.Thread(Sub() result4 = process(data4).Clone threadid(3) = System.Threading.Thread.CurrentThread.ManagedThreadId End Sub) t1.Start() row = 0 data5(row) = value ElseIf data1(rowlimit) <> "" And data2(rowlimit) <> "" And data3(rowlimit) <> "" And data4(rowlimit) <> "" And data5(rowlimit) <> "" Then threadid(4) = "" t1 = New Threading.Thread(Sub() result5 = process(data5).Clone threadid(4) = System.Threading.Thread.CurrentThread.ManagedThreadId End Sub) t1.Start() row = 0 check1 = True End If row += 1 End If
If check1 = True Then Do System.Threading.Thread.Sleep(100) Loop Until threadid(0) <> "" And threadid(1) <> "" And threadid(2) <> "" And threadid(3) <> "" And threadid(4) <> "" row = 0 ReDim data1(rowlimit) data1(row) = value row += 1 result1_update(result1) ' consolidate result into a single array result2_update(result2) ' consolidate result into a single array result3_update(result3) ' consolidate result into a single array result4_update(result4) ' consolidate result into a single array result5_update(result5) ' consolidate result into a single array check1 = False ReDim data2(rowlimit), data3(rowlimit), data4(rowlimit), data5(rowlimit) End If
loop until read.endofstream
end sub
' Function that calculate the sum of each row and columns
Function process(ByVal data() As String) As String() Dim line(), line1(), result() As String Dim check As Boolean redim result(0)
For n = 0 To (data.Count - 1) if result(0) = "" and result.count = 1 then result(result.count-1) = data(n) else check = true line1 = Split(data(n), ",", -1, CompareMethod.Text) For m = 0 to (result.count-1) line = split(result(m),",",-1, CompareMethod.Text) if line1(0) = line(0) then check = false for o = 1 to (line1.count-1) line(o) = val(line1(o)) + val(line(o)) next o result(m) = join(line,",") exit for end if Next m
if check = true then redim preserve result(result.count) result(result.count-1) = join(line1,",") end if end if Next n
redim preserve result(result.count-2) process = result.clone End Function
Upvotes: 0
Views: 1203
Reputation: 6948
Looking at your code I noticed a couple of things:
you're using Val
which is very easy to use but has a high overhead. Integer.Parse
would work much more efficiently.
You're converting from string to number back to string way more than you should have to. Since your summary will only be a fraction of the size of your complete data, you shouldn't have any trouble storing the results in memory. A Dictionary(Of Integer, Integer())
would work well for this.
Consider this code which will read the data, summarize it and put the data in a format easy to write to a file all in less than 10 secs. using random integers up to 1000:
Function SummarizeData(filename As String, delimiter As Char) As Dictionary(Of Integer, Integer())
Dim limit As Integer = 0
SummarizeData = New Dictionary(Of Integer, Integer())
Using sr As New IO.StreamReader(filename)
'Since we don't need the first line for the summary we can read it get _
'the upper bound for the array, and discard the line.
If Not sr.EndOfStream Then
limit = sr.ReadLine.Split(delimiter).Length - 1
Else : Throw New Exception("Empty File")
End If
Do Until sr.EndOfStream
'This creates an array of integers representing the data in one line.
Dim line = sr.ReadLine.Split(" "c).Select(Function(x) Integer.Parse(x)).ToArray
'If the key is already in the dictionary we increment the values
If SummarizeData.ContainsKey(line(0)) Then
For I = 1 To limit
SummarizeData.Item(line(0))(I) += line(I)
Next
Else
'If not we create a new element using the line as the initial values
SummarizeData.Add(line(0), New Integer(limit) {})
SummarizeData.Item(line(0)) = line
End If
Loop
End Using
End Function
To use the function and write the data, this would work:
Dim results = SummarizeData("data.txt", ","c)
'If you don't need the results sorted you can gain a few fractions of a second by _
'removing the Order By clause
IO.File.WriteAllLines("results.txt", (From kvp In results
Order By kvp.Key
Select String.Join(",", kvp.Value)).ToArray)
Upvotes: 0