iHaveQuestion
iHaveQuestion

Reputation: 1

GZIPInputStream: Can't read online .gz file correctly line by line

I want to read GZIP files(including thousands of number id) which I'm accessing through HTTP line by line.sometimes it can read correctly,but not correctly more; This is my current approach:

BufferedReader br = null;
List<Long> list = Lists.newArrayList();
URL url = new URL("xx.gz");
try{
   br = new BufferedReader(new InputStreamReader(new GZIPInputStream(url.openStream())));
   String line = null;
   while ((line = br.readLine()) != null){
       if (NumberUtils.isDigits(line)) {
           try {
               list.add(Long.valueOf(line));
           }catch (Exception e){
               logger.error("parse line:{} error:",line,e);
               continue;
           }
           if (list.size() == 20) {
             //batch handle
             list = Lists.newArrayList();
           }
       }
   }
}catch (Exception e){
   logger.error("handle file error:",e);
}finally {
   if(br != null){
      br.close();
   }
}

I find lots of "parse line" error log cause line is larger than Long.MaxValue(eg: 10352194518417219194627517808180732445615956450138943); when i download the GZIP file and find there isn't such line larger than Long.MaxValue;

JAVA version : 1.8

OS Version : CentOS release 6.9 (Final)

first few hundreds of line can read correctly,then log shows read unordered nums don't exist in the file.

After some tests and aceess some information on the Internet,

I guess server push file to the memory quickly then client read and handle slowly which may lead to memory leak. then server initiatively close while tcp connection still alive. At this moment,client may get error data untill all data in the memory is read.

finally it's my first question in stackoverflow, sorry for lack of standardization of representation.

Upvotes: 0

Views: 537

Answers (1)

Stephen C
Stephen C

Reputation: 718986

I find lots of "parse line" error log cause line is larger than Long.MaxValue(eg: 10352194518417219194627517808180732445615956450138943. When I download the GZIP file and find there isn't such line larger than Long.MaxValue;

The error is coming from these lines:

   if (NumberUtils.isDigits(line)) {
       try {
           list.add(Long.valueOf(line));
       }catch (Exception e){
           logger.error("parse line:{} error:",line,e);
           continue;
       }
       ...
   }

First of all, your diagnosis is not correct. The problem is not caused lines that are larger (longer) than Long.MaxValue. The problem is that the lines represent numbers that are larger than Long.MaxValue. That causes Long.valueOf to fail.

So, this is not a problem with the GZipStream or HTTP or downloading or the encoding of the file or some of the other things that commentators have speculated about. And it is not (really) a problem with the lines being too long. (That line is only about 55 characters long.)

The problem could that you have chosen the wrong way to represent the (apparently valid) numbers that you are reading from the file.

So what representation should you use?

It depends what these numbers mean:

  • If they are really integers, use BigInteger.
  • If they are actually multiple integers "smooshed together" in some way, you could try parsing them. This assumes that you understand the format and/or the "smooshing" process.
  • If they are actually identifiers of some kind, use String.

Alternatively, it could be that there is a bug in the software that is generating the file you are reading. For example, it might be (incorrectly) joining two or more numbers together into a single line.

But you would be in a better position to know that than us. We have no idea what your file is supposed to mean ... and whether really large numbers are actually valid data.


Alternatively, it could be that you have misunderstood the description, specification, examples, conversation, or whatever ... where the file format was explained.

Upvotes: 1

Related Questions