When and where does splitting happen?

Question

For example I have a 1 Gb file in HDFS like

2018-10-10 12:30  
EVENT INFORMATION  
2018-10-10 12:35  
ANOTHER EVENT INFORMATION  
...

So I need to use NLineInputFormat (N = 2), right?
The question is about more conceptual principles. When and where is this 1 Gb file transformed into InputSplits? How does hadoop handle different splitting logic? Does it need to parse the whole file to create splits (because we need to go through the file to count lines one-by-one)?
This file is divided into 8 blocks in HDFS (1024 / 128). So, when I submit a job hadoop starts mapper on each node with the blocks for this file (with default split size).

What will happen if my file isn't divided neatly? Like


...
2018-10-10 12:


40
EVENT INFORMATION
...

How will first mapper know about remaining part that is located on another datanode?

What happen if split size = 1/2 block size? Or 4/5 block size? How does application master know which node should it pick to run a split?

Can you please make it clear and give me some links to get it more deeply?

HbnKing · Accepted Answer

The division of data ( the division of File into Block), this is physically true division
Split and HDFS Block are one to many relationships;

HDFS block is the physical representation of data, while Split is the logical representation of data in block.

In the case of data locality, the program also reads a small amount of data from remote nodes, because rows are cut to different Block.

when you read a file , it is like this

The client opens the file by calling the open() method of the FileSystem object (corresponding to the HDFS file system and calling the DistributedFileSystem object) (that is, the first step in the figure). The DistributedFileSystem calls the NameNode through the RPC (Remote Procedure Call) call to get this. The file location of the first few blocks of the file (step 2). For each block, the namenode returns the address information of all the namenodes that have this block backup (sorted by the distance from the client in the topology network of the cluster. See the following for how to perform the network topology in the Hadoop cluster).

If the client itself is a datanode (if the client is a mapreduce task) and the datanode itself has the required file block, the client reads the file locally.

After the above steps are completed, DistributedFileSystem will return an FSDataInputStream (support file seek), the client can read data from FSDataInputStream. FSDataInputStream wraps a DFSInputSteam class that handles I/O operations for namenodes and datanodes.
The client then executes the read() method (step 3), and the DFSInputStream (which already stores the location information of the first few blocks of the file to be read) is connected to the first datanode (that is, the most recent datanode) to get the data. By repeatedly calling the read() method (fourth and fifth steps), the data in the file is streamed to the client. When the end of the block is read, the DFSInputStream closes the stream pointing to the block, and instead finds the location information of the next block and then repeatedly calls the read() method to continue streaming the block.
These processes are transparent to the user, and it appears to the user that this is an uninterrupted streaming of the entire file.
When the entire file is read, the client calls the close() method in FSDataInputSteam to close the file input stream (step 6).

When and where does splitting happen?

Answers (1)

Related Questions