TPL Dataflow to be implemented for a Website scraper

Question

Well I know that my question requires more of a guidance then technicalities but I hope that SO members will not mind a newbie to TPL Dataflow in asking some very basics questions.

I have a simple Windows Form Application which is responsible for extracting data from Excel files on my system and save them on the database. The process is too long and I wanted to make it Asynchronous and Parallel. Below is the brief for my scenario.

Call function to Open connection to the database at the start

Call function to Update database with the time of the operation

Application needs to process suppose 100 Excel files which are in incremental order. For this I have used FileNumber which is incremented with each call.

Call function to UpdateUI (PageNumber is passed) (Example. File 1 processing)

Call function to Read Excel file (PageNumber is passed)

Call function to Process Excel file data (Excel Data is passed and PageNumber is passed)

Call function to Save values in the database (Excel Data is passed and PageNumber is passed)

Call function to UpdateUI (PageNumber is passed) (Example. File 1 processed)

Now what I have achieved is that I am able to make this process Asynchronous using Tasks. I have used async and await for all long running operations and converted my functions to Tasks.

Now I want to make some Tasks run parallel. Not every task will be parallel such as OpenDatabase connection will just be Asynchronous. But I want to create a single Task or Function which will be using Dataflow Blocks for every Task/Function in my Application from Updating UI to ReadingExcel file and Saving them into the database.

I started using the ActionBlock to try this but there are so many different blocks that I know nothing of. Kindly guide me which block will be used in this situation. And if someone provide a pseudo code for this scenario then it will be really great. I will have something to start from.

Adnan Yaseen · Accepted Answer

After studying about TPL Dataflow I managed to have a basic understanding of it and its blocks. I am mentioning my understanding below in case someone else needs a head start.

TPL Dataflow is build on TPL (Tasks Parallel Library) and its primary purpose is to implement the producer/reciever (actor/agent) designs.

TPL Dataflow is made up of the blocks which are also know as dataflow blocks. The purpose of these dataflow blocks is to buffer, process and propogate data. Each block can be receiver or producer and can be both.

Every block implements the IDataflowBlock regardless of its purpose (reciever, producer). The purpose of this Interface is to make a class as dataflow block. Second purpose is to enabling any block to shutdown either by a successfully completing or by a fault and finally this interface enables the blocks to return System.Threading.Task which represents the block completion asynchronously.

Furthermore there are different other Interfaces which are used according to their purpose i.e. Receiver, Producer or Propagator. Reciver blocks implements ISourceBlock, Producer blocks implements ITargetBlock and propagator IPropagatorBlock.

Blocks can also be categorized in to other categories like,

>> Execution Blocks
    >> ActionBlock
    >> TransformBlock
    >> TransformManyBlock

>> Buffering Blocks
    >> BufferBlocl
    >> BrodcastBlock
    >> WriteOnceBlock

>> Joining Blocks
    >> BatchBlock
    >> JoinBlock
    >> BatchedJoinBlock

Apart from these built in blocks, custom blocks can also be written but in most of the cases these blocks serves the purpose. Furthermore I could also include the purpose of each block but it would then become a article. This is my basic understanding and I am still learning and exploring the TPL Dataflow.

If someone want to have understanding for TPL Dataflow specifically for a Data scraper then here is the sample dataflow blocks diagram to help understand the process better.

Source: https://petermeinl.wordpress.com/2012/10/13/a-webcrawler-demonstrating-the-beauty-of-tpl-dataflow/

TPL Dataflow to be implemented for a Website scraper

Answers (1)

Related Questions