ShayanKM
ShayanKM

Reputation: 571

Tips to optimize this .NET crawler algorithm

I'm writing something like a web crawler that it's engine follows these steps:

  1. Reading Rss Link(Argument)
  2. Defining a list(of) Rss Items
  3. Checking each link existence in database(SQL SERVER) by a separate query
  4. If the link was new one it will insert the fields to DB by a separate query

    Public Sub MyTickHandler()
        Dim NewItems As New List(Of Structures.RSSItem)
        Dim founded As Boolean = False
    
        NewItems = RssReader.ParseRssFile(RssURL)
    
        Dim connString = Configs.NewsDBConnection
        Dim myConnection As SqlConnection = New SqlConnection("Server=localhost;Database=db;Integrated Security=SSPI;;Connection Timeout=45;Max Pool Size= 300")
        myConnection.Open()
    
        For Each item In NewItems
            Dim cmdString As String = "SELECT id FROM posts  with (nolock) WHERE link LIKE '" & item.link.Trim.ToLower & "'"
            Dim TheCommand As SqlCommand = New SqlCommand(cmdString, myConnection)
            Dim result = TheCommand.ExecuteScalar()
            If result Is Nothing Then
                TheCommand = New SqlCommand("INSERT INTO posts (link) VALUES ('" & item.link.ToLower.Trim & "')")
                TheCommand.Connection = myConnection
                TheCommand.ExecuteNonQuery()
    
                TheCommand = New SqlCommand("INSERT INTO queue (link,descrip,site,title,category) VALUES ('" & item.link.ToLower.Trim & "','" & StringToBase64(item.description) & "','" & RssSite & "','" & StringToBase64(item.title) & "','" & RssCategory & "')")
                TheCommand.Connection = myConnection
                TheCommand.ExecuteNonQuery()
            End If
            TheCommand.Dispose()
        Next
    
        myConnection.Close()
        myConnection.Dispose()
        SqlConnection.ClearPool(myConnection)
    
     End Sub
    

This works perfect for single calling.
but I have something about 150 Rss links and I should Check each of them every 2 minutes by threading, so by increasing the mount of SQL Queries, this process and also sql server won't response and application crashes!!

I tried some tips like increasing sql server response timeout, but it didn't help at all.

Any better way or tips for this process?
Thanks

Upvotes: 1

Views: 117

Answers (2)

Dan Guzman
Dan Guzman

Reputation: 46233

I suggest you pass a table-valued parameter to a stored procedure for this task. That will allow the entire list to be inserted in a single call. Below is an example you can tweak for your actual column lengths. It is important to have an index on the link column of the posts table. I assume link is unique in this example.

T-SQL to create table type and proc:

CREATE TYPE dbo.linkInfo AS TABLE(
     link varchar(255) NOT NULL PRIMARY KEY
    ,descrip varchar(255)
    ,title varchar(255)
    );
GO

ALTER PROC dbo.usp_InsertRssItems
     @site varchar(255)
    ,@category varchar(255)
    ,@linkInfo dbo.linkInfo READONLY
AS

SET NOCOUNT ON;

DECLARE @InsertedPosts TABLE(link varchar(255));

INSERT INTO dbo.posts(link)
OUTPUT inserted.link INTO @InsertedPosts
SELECT link
FROM @linkInfo AS li
WHERE NOT EXISTS(
    SELECT *
    FROM dbo.posts AS p
    WHERE p.link = li.link
    );

INSERT INTO dbo.queue(link,descrip,site,title,category)
SELECT li.link, li.descrip, @site,li. title, @category
FROM @linkInfo AS li
WHERE EXISTS(
    SELECT *
    FROM @InsertedPosts AS ip
    WHERE ip.link = li.link
    );
GO

Sample VB.NET code:

Sub MyTickHandler()

    Dim NewItems As New List(Of Structures.RssItem)
    Dim founded As Boolean = False

    NewItems = RssReader.ParseRssFile(RssURL)

    Dim dt = getNewRssItemDataTable(NewItems)

    Dim connString = Configs.NewsDBConnection
    Dim myConnection As SqlConnection = New SqlConnection("Server=localhost;Database=db;Integrated Security=SSPI;;Connection Timeout=45;Max Pool Size= 300")
    Dim TheCommand As SqlCommand = New SqlCommand("dbo.usp_InsertRssItems", myConnection)
    TheCommand.Parameters.Add(New SqlParameter("@site", SqlDbType.VarChar, 255)).Value = "z"
    TheCommand.Parameters.Add(New SqlParameter("@category", SqlDbType.VarChar, 255)).Value = "z"
    TheCommand.Parameters.Add(New SqlParameter("@linkInfo", SqlDbType.Structured)).Value = dt
    TheCommand.CommandType = CommandType.StoredProcedure

    myConnection.Open()
    TheCommand.ExecuteNonQuery()

    myConnection.Close()
    myConnection.Dispose()

End Sub

Private Function getNewRssItemDataTable(NewRssItems As List(Of Structures.RssItem)) As DataTable

    Dim dt As New DataTable
    dt.Columns.Add("link", GetType(String)).MaxLength = 255
    dt.Columns.Add("descrip", GetType(String)).MaxLength = 255
    dt.Columns.Add("title", GetType(String)).MaxLength = 255

    For Each NewRssItem In NewRssItems
        Dim row = dt.NewRow
        dt.Rows.Add(row)
        row(0) = NewRssItem.link
        row(1) = NewRssItem.description
        row(2) = NewRssItem.title

    Next NewRssItem

    Return dt

End Function

EDIT:

I see you mentioned you would like a SqlBulkCopy example. If inserts are unconditional, you can use this technique:

Sub executeBulkInsert(connectionString As String, site As String, category As String, NewRssItems As List(Of Structures.RssItem))

    Dim dt As New DataTable

    dt.Columns.Add("link", GetType(String)).MaxLength = 255
    dt.Columns.Add("descrip", GetType(String)).MaxLength = 255
    dt.Columns.Add("site", GetType(String)).MaxLength = 255
    dt.Columns.Add("title", GetType(String)).MaxLength = 255
    dt.Columns.Add("category", GetType(String)).MaxLength = 255

    For Each NewRssItem In NewRssItems
        Dim row = dt.NewRow
        dt.Rows.Add(row)
        row(0) = site
        row(1) = category
        row(2) = NewRssItem.link
        row(3) = NewRssItem.description
        row(4) = NewRssItem.title

    Next NewRssItem

    Dim bcp = New SqlBulkCopy(connectionString)
    bcp.DestinationTableName = "dbo.queue"

    bcp.WriteToServer(dt)

End Sub

Upvotes: 1

tofi9
tofi9

Reputation: 5853

  • Only do one single fetch, outside the for-each-loop:

SELECT id, link FROM posts with (nolock) WHERE link in (@listOfLowerCaseLinks)

Dim myListOfLinks As New List(Of String)
...
TheCommand.Parameters.AddWithValue("@listOfLowerCaseLinks", myListOfLinks)
  • Wrap the whole action of inserts (the whole for-each-loop) into a sql transaction. That way, the database won't have to commit in-between.

Upvotes: 1

Related Questions