Reputation: 887
I have a very big text file. I want to determine the number of bytes of each line and save it in another file.
Upvotes: 1
Views: 7171
Reputation: 74750
If you have some definition of what constitutes a "line" in your big file, you can simply iterate over your file byte-by-byte and at each occurrence of a line end or line start you memorize the current index.
For example, if you have a unix text file (i.e. \n
as line delimiter), this may look like this:
/**
* a simple class encapsulating information about a line in a file.
*/
public static class LineInfo {
LineInfo(number, start, end) {
this.lineNumber = number;
this.startPos = start;
this.endPos = end;
this.length = endPos - startPos;
}
/** the line number of the line. */
public final long lineNumber;
/** the index of the first byte of this line. */
public final long startPos;
/** the index after the last byte of this line. */
public final long endPos;
/** the length of this line (not including the line separators surrounding it). */
public final long length;
}
/**
* creates an index of a file by lines.
* A "line" is defined by a group of bytes between '\n'
* bytes (or start/end of file).
*
* For each line, a LineInfo element is created and put into the List.
* The list is sorted by line number, start positions and end positions.
*/
public static List<LineInfo> indexFileByLines(File f)
throws IOException
{
List<LineInfo> infos = new ArrayList<LineInfo>();
InputStream in = new BufferedInputStream(new FileInputStream(f));
int b;
for(long index = 0, lastStart = 0, lineNumber = 0;
(b = in.read()) >= 0 ;
index++)
{
if(b == '\n') {
LineInfo info = new LineInfo(lineNumber, lastStart, index);
infos.add(info);
lastStart = index + 1;
lineNumber ++;
}
}
return infos;
}
This avoids any conversion of bytes to chars, thus any encoding issues. It still depends on the line separator being \n
- but there could be a parameter to give it to the method.
(For DOS/Windows files with \r\n
as separator the condition is a bit more complicated, as we would either have to store the previous byte, or do a lookahead to the next one.)
For easier use, maybe instead of a list a pair (or triple) of SortedMap<Long, LineInfo>
could be better.
Upvotes: 0
Reputation: 12575
The following code extracts
byte[] chunks = null;
BufferedReader in =
new BufferedReader (new InputStreamReader(new FileInputStream(path +"/"+filePath),"UTF-8"));
String eachLine = "";
while( (eachLine = in.readLine()) != null)
{
chunks = eachLine.getBytes("UTF-8");
System.out.println(chunks.length);
}
Upvotes: 2
Reputation: 10329
Using java.io.BufferedReader, you can easily read each line as a separate String. The number of bytes used by a line depends on the encoding used. For a simple ASCII encoding, you can simply use the length of the String, since each character takes up one byte. For multi-byte encodings like UTF-8, you would need a more complicated approach.
Upvotes: 2
Reputation: 23629
Create a loop that:
Upvotes: 1