Reputation: 13
I am trying to write some simple code to index some wikipedia xml pages. The idea was to get the byte offset of each character by reading in a character using streamreader, then saving the position from the byte stream so I could get back to that position later.
using a short test file that just contains "感\na\nb" (8 bytes) with new line after each character. Then I tried using this code in the main function :
using System;
using System.IO;
namespace indexer
{
class MainClass
{
public static void Main(string[] args)
{
StreamReader sr = new StreamReader (@"/home/chris/Documents/len.txt");
Console.Out.WriteLine(" length of file is " + sr.BaseStream.Length + " bytes ");
sr.Read (); // read first byte.
Console.Out.WriteLine(" current position is " + sr.BaseStream.Position);
sr.Close ();
}
}
}
this gives the output :
length of file is 8 bytes
current position is 8
The position should be 3, as it should only read the first character. If I use sr.Read() again, I do get the next character correctly, but the position remains 8.
Am I misunderstanding how this should work, or have I discovered a bug of some sort?
Thank you.
Upvotes: 1
Views: 436
Reputation: 2781
No, it is not a bug. StreamReader
uses a 1 KB buffer inside which is filled up when you call StremReader.Read()
.
You should call Encoding.GetByteCount()
method to get a number of bytes in a character or a string is being read. Current encoding can be found in StreamReader.CurrentEncoding
.
Upvotes: 1