Reputation: 4821
I am trying to verify there is a correct amount of tabs in a large text file (8,000,000+ rows) before importing to SQL Server.
I think I need to do something like this:
int count = 0;
char tab = "\t";
foreach(char c in tab)
{
if(char.IsTab(c))
{
count++;
}
}
However, this is not correct. I need to do this to verify that the file is in the proper format.
Upvotes: 3
Views: 2777
Reputation: 5189
With such a large amount of data, you want to avoid loading the whole file into memory at once. Here is a solution that just loads one line of the file into memory at once and counts the tabs in that line. The results are saved to an int[]
where each item in the array contains the count of tabs on the corresponding line.
int[] counts = File.ReadLines("myfile.txt")
.Select(l => l.Count(c => c == '\t'));
UPDATE
If you simply want the total count of all tabs in the file, then you could do:
int sum = File.ReadLines("myfile.txt")
.Select(l => l.Count(c => c == '\t'))
.Sum();
Upvotes: 2
Reputation: 9041
With Linq
you can get your bad rows like this:
using System;
using System.Collections.Generic;
using System.Linq;
public class Program
{
public static void Main()
{
int expectedNumberOfTabs = 5;
List<string> rows = new List<string>
{
"col1 \t col2 \t col3 \t col4 \t col5 \t col6",
"col1 \t col2 \t col3 \t col4 \t col5 \t col6",
"col1 \t col2 \t col3 \t col4",
"col1 \t col2 \t col3 \t col4 \t col5 \t col6 \t col7",
"col1 \t col2 \t col3 \t col4 \t col5 \t col6",
"col1 \t col2 \t col3 \t col4 \t col5",
"col1 \t col2 \t col3 \t col4 \t col5 \t col6",
};
var badRows = rows.Where(row => row.Count(c => c == '\t') != expectedNumberOfTabs);
foreach (var badRow in badRows)
{
// Fix the bad rows
Console.WriteLine(badRow);
}
}
}
Results:
col1 col2 col3 col4
col1 col2 col3 col4 col5 col6 col7
col1 col2 col3 col4 col5
Now I don't expect you to read all 8,000,000+ rows into memory at once. I think you'd read them in one row at a time and deal with them one at a time, so the line from this snippet there you're really interested in is:
row.Count(c => c == '\t') != expectedNumberOfTabs
Which will identify a "bad" row for you to fix.
Because you're dealing with a massive amount of data you may want to try copying the lines from the file to a new file, fixing bad lines as you run across them. Once you have your new "fixed" file, delete the original file, and then rename the "fixed" file back to your original file and import it into your database.
using System.IO;
using System.Linq;
public class Program
{
public static void Main()
{
int expectedNumberOfTabs = 5;
string originalFile = "MyFile.txt";
string originalFileFixed = "MyFileFixed.txt";
using (StreamReader sr = new StreamReader(originalFile))
using (StreamWriter sw = new StreamWriter(originalFileFixed))
{
string line = sr.ReadLine();
if (line.Count(c => c == '\t') != expectedNumberOfTabs)
{
// line = ...Fix the line
}
sw.WriteLine(line);
}
// Delete original file
File.Delete(originalFile);
// Rename the fixed file back to the original file
File.Move(originalFileFixed, originalFile);
// Import the file
}
}
Upvotes: 3
Reputation: 3342
This seems to work for me:
int count = 0;
string tab = "te\tst\t\t\t";
foreach(char c in tab.ToCharArray())
{
if (c == '\t') // there is no char.IsTab() method
{
count++;
}
}
Console.WriteLine(count);
Gives me this result:
4
Your original code was not working because you declared tab
as a char
, which can't be iterated. I changed it to a string
and iterate the string
as an array of char
s.
This may not be the optimal way, but it is a working way based on your original code.
Upvotes: 1
Reputation: 755
If you must do it before uploading, you could use a StreamReader so you don't load the entire file to a string. But I guess it could be slow. Maybe you could divide the file in equal chunks and have separate threads process it.
Here's a sequential solution:
int count = 0;
using (StreamReader sr = new StreamReader(@"c:\temp\file.txt"))
{
count += sr.ReadLine().Count(f => f == '\t');
}
Upvotes: 0