找回密码
 注册账户
查看: 8561|回复: 1

Web Spider Code

[复制链接]
棋子 发表于 2007-7-6 04:14:13 | 显示全部楼层 |阅读模式
This is the last part of our tutorial in which we will explain key code parts of web spider program.


"C_3OXO.vb" Class

This class is the main class in web spider application. It contains all the data and code needed to execute the web crawling process. At top of the class file add the following imports statements.

Imports mshtml

Imports System.Net

Imports System.io

Imports System.Data.OleDb

Imports System.Data


Then add the following declarations.

Public Class C_3OXO


   Dim DBFilePN As String

   Dim DBC As c_3OXO_DB

   Dim UrlsT As DataTable

   Public MLevel As Integer

   Public TimOut As Integer

   Public F1 As Form1


DBFilePN : is a private string represents the new created database file path name.
DBC : is a variable of type C_3OXO_DB which represents the database class.
UrlsT : a private variable of type data table and it represents in memory representation of a table consists of one column to manage the URLs list (crawler frontier).
MLevel : a public integer variable represents the maximum deeper tree level the spider have to reach.
TimeOut : a public integer represents the allowed web browser control maximum time out in seconds before it stops trying to get the current URL.
F1 : is a public variable of type Form1. It will be used to refer to form1 object.

Go Spider

  
Public Sub GoSpider(ByVal URL As String, ByVal DBF As String)



      DBC = New c_3OXO_DB

      DBFilePN = DBF



      CreateDBOutputFile(DBFilePN)

      DBC.Initial(DBFilePN)

      GetWebSite(URL)



   End Sub

This is the main public subroutine in this class. This is the sub we need to call to start the whole process. This sub receives two parameters: the web site URL like for example "http://www.google.com", and the database file name where the user want to store the web site data. At its first line it creates an instant of the database class and assigns it to the "DBC" variable. It assigns the input parameter "DBF" to the class parameter "DBFilePN" to be available to all class methods. Then it creates the output database file by calling the "CreateDBOutputFile" and passing to it the "DBFilePN" class variable. It initializes the database class instant by calling its "Initial" method. Finally it call the "GetWebSite" method which make the actual crawling work as we will see later.

Create Output Database File

Private Function CreateDBOutputFile(ByVal odbfpn As String) As Boolean



      Dim s As String()

      Dim TS As String



      Try

         s = Environment.GetCommandLineArgs()

         TS = s(0).Substring(0, s(0).LastIndexOf("\") + 1) & DBC.DBTemplateFPN

         IO.File.Copy(TS, odbfpn)

      Catch ex As Exception

         Return False

      End Try



      Return True



   End Function

This private function main purpose is to locate the location of the application EXE file, and then copy the template database file saved at this location to the new specified database file path name. It does its function by making use of the environment "GetCommandLineArgs" method which returns the application EXE folder path as its output parameter.

Get Web Site

   Private Sub GetWebSite(ByVal URL As String)



      F1.SendMessage("Initialization ...")

      InitializeSiteTable(URL)



      F1.SendMessage("Gets all URLs ...")

      GetWebSiteAllURLs()



      F1.SendMessage("Saves URLs to database ...")

      SaveWebSite()



   End Sub

You can think of this method as the maestro for all the crawling operation. It is the one who orders the user interface to adjust itself depending on the current state. It initializes the in memory site URLs representation as we will see later. It starts the process for getting all the links in the given web site. After that it starts the saving and insertion operation to the database file.

Initialize in-memory URLs Table

   Private Sub InitializeSiteTable(ByVal URL As String)



      UrlsT = New DataTable



      Dim ID As New DataColumn

      ID.AllowDBNull = False

      ID.ColumnName = "ID"

      ID.DataType = GetType(System.Int32)

      ID.Unique = True

      ID.AutoIncrement = True

      UrlsT.Columns.Add(ID)



      Dim Href As New DataColumn

      Href.AllowDBNull = False

      Href.ColumnName = "Href"

      Href.DataType = GetType(System.String)

      Href.Unique = True

      UrlsT.Columns.Add(Href)



      Dim Status As New DataColumn

      Status.ColumnName = "Status"

      Status.DataType = GetType(System.Boolean)

      UrlsT.Columns.Add(Status)



      Dim PKeys(1) As DataColumn

      PKeys(0) = ID

      UrlsT.PrimaryKey = PKeys



      Dim TRow As DataRow



      TRow = UrlsT.NewRow



      TRow.Item(1) = URL

      TRow.Item(2) = False

      UrlsT.Rows.Add(TRow)



   End Sub

This subroutine creates a new data table instant and configures this table as follows. It defines three columns ID, Href, and Status. The "ID" field is the primary key column for this table. The "Href" field is the column where the URL of the current link will be saved. The "Status" column is a Boolean field indicates whether the current link is visited or not. At last a new row contains the current web site URL is added to the table. You can think of this table as the crawler URLs list. The first added row to this table which is the row that contains the web site URL is the seed of the crawler. Other added URLs will consist the frontier of the crawler. We use this memory structure other than using for examples simple arrays or lists, to make benefit of the uniqueness check supported by this way.

Get All URLs

   Private Sub GetWebSiteAllURLs()



      Dim i As Integer

      Dim TS As TimeSpan = TimeSpan.FromSeconds(TimOut)

      Dim Rows() As DataRow

      'The current level of the web site tree

      Dim CLevel As Integer = 0





      Do

         'increment the current level value by one

         If MLevel = -1 Then CLevel = MLevel - 1 Else CLevel += 1

         'CLevel += 1

         Rows = UrlsT.Select("status = false")

         For i = 0 To Rows.Length - 1

            Try

               F1.AdvanceProgressbar()

               F1.SendMessage("Gets URLs: " + Rows(i).Item(1) + "  ...")

               GetWebPageURLs(Rows(i).Item(1), TS)

               ' set the status of the row to true

               UrlsT.Rows.Find(Rows(i).Item(0)).Item(2) = True

            Catch ex As Exception

            End Try

         Next

      Loop While Rows.Length <> 0 And CLevel < MLevel



   End Sub

The algorithm behind this function is to define a new variable "CLevel" represents the current working level in the web site tree. It firstly given the value of zero represents the top level of the tree which is the web site address or URL. Then enters a loop that does the following: 1. If the MLevel = -1 that means that the web spider will traverse the web site till find no new URLs to visit. Set the CLevel which represents the current level according to the MLevel. 2. Extract all the rows in the URLs table that have a status of false (not visited yet). 3. Start a for loop to get the URLs in each page represented by a row in the rows extracted in step2. Then change the status of the visited row to true. Go again to step1 till find no new rows to visit or the current level exceeds the maximum allowed level.

Get Web Page URLs

   Private Sub GetWebPageURLs(ByVal url As String, ByVal TS As TimeSpan)



      Dim Doc As mshtml.HTMLDocument



      Doc = Navigate2WebPage(url, TS)

      If Doc Is Nothing Then Return



      ' Get all URLs in the current doc

      Dim AnchorsArr As IHTMLElementCollection = Doc.links

      Dim Anchor As IHTMLAnchorElement



      'Add each anchor to the URLS table

      For Each Anchor In AnchorsArr

         Dim NRow As DataRow

         NRow = UrlsT.NewRow

         Try

            NRow.Item(1) = Anchor.href

            NRow.Item(2) = False

            UrlsT.Rows.Add(NRow)            

         Catch ex As Exception

         End Try

      Next



   End Sub

This subroutine takes a URL and a time interval. It defines an HTMLDocument variable, navigate to the URL using the web browser control and assign the returned document to the HTML document defined early. Then it defines an HTML elements collection and assigns to it the HTML document links. It then traverses the collection and adds each link element to the in-memory URLs table and make its status to false.

Navigate to a Web Page

   Private Function Navigate2WebPage(ByVal URL As String, ByVal TimeoutInterv _

     As TimeSpan) As HTMLDocument



      Dim T1, T2 As Date

      Dim Interv As TimeSpan



      Try



            F1.AxWebBrowser1.Navigate2(URL)

         T1 = Now()



         Do While (F1.AxWebBrowser1.ReadyState <> SHDocVw.tagREADYSTATE.READYSTATE_COMPLETE)

            Application.DoEvents()

            T2 = Now

            Interv = T2.Subtract(T1)

            If TimeSpan.Compare(Interv, TimeoutInterv) = 1 Then Return Nothing

         Loop



      Catch ex As Exception

         Return Nothing

      End Try





      Return F1.AxWebBrowser1.Document



   End Function

This function navigates the web browser control to the entered URL. Waiting till the document loaded completely into the browser control by testing the ready state of the web browser control. Then returning the web browser document.

At this stage the program do all what is needed to collect the URLs in the given web site and stores them in the in-memory URLs table. The following methods take these URLs table and visit each URL in it in turn to get the web page HTML text and store it to the database.

Get Web Page

   Private Function GetWebPage(ByVal URL As String) As String



      Dim myWebRequest As WebRequest

      Dim myWebResponse As WebResponse



      Try

         ' Create a new 'WebRequest' object to the mentioned URL.

         myWebRequest = WebRequest.Create(URL)

         ' The response object of 'WebRequest' is assigned to a 'WebResponse' variable.

         myWebResponse = myWebRequest.GetResponse()



      Catch ex As Exception

         Return "ERORR!"

      End Try



      Dim RString As String



      Try

         Dim streamResponse As Stream = myWebResponse.GetResponseStream()

         Dim SReader As New StreamReader(streamResponse)

         RString = SReader.ReadToEnd



         streamResponse.Close()

         SReader.Close()

         myWebResponse.Close()



      Catch ex As Exception

         Return "ERORR!"

      End Try



      Return RString



   End Function

This function takes a URL as an input and returns a string contains the HTML text of the current page. This is done by using the "WebRequest" and "WebResponse" classes.   

Save Web Site

   Private Sub SaveWebSite()



      Dim i As Integer

      Dim str As String



      For i = 0 To UrlsT.Rows.Count - 1

         F1.AdvanceProgressbar()

         str = UrlsT.Rows(i).Item(1)

         F1.SendMessage("Saves to database: " + str + " ...")

         SaveWebPage(str, GetWebPage(str))

      Next



   End Sub

This subroutine travers the URLs table, get the URL from it, Get the HTML text of it, then saves the URL, and the Page text to the data base file.

Save Web Page

   Private Function SaveWebPage(ByVal URL As String, ByVal Page As String) As Integer

      Return DBC.Insert(URL, Page)

   End Function

This function saves the entered URL and page string to the database using the insert method of the database class.
"Form1.vb" class

The button click handler method

In the button click event handler method, some checks on the user typed URL and database file are carried out.

      Dim cls As New C_3OXO



      cls.F1 = Me

      cls.TimOut = Integer.Parse(Me.Tb_TimeOut.Text)



      If Me.CB_MLevel.Checked Then

         cls.MLevel = Integer.Parse(Me.Tb_MLevel.Text)

      Else

         cls.MLevel = -1

      End If

Then a new instant of type C_3OXO class is defined and created. The public variables of the created instant are assigned as shown in the above code.

        cls.GoSpider(Me.TB_URL.Text, Me.TB_DBFile.Text)

Then the "GoSpider" method is called to start the whole crawling process.

That is all.

To download the complete program, just click here.
您需要登录后才可以回帖 登录 | 注册账户

本版积分规则

存档|黑屋|手机|网络实验室 本站服务器由美国合租以及IDCLayer国际数据提供!!!

GMT+8, 2026-6-15 21:25 , Processed in 0.011554 second(s), 9 queries , Gzip On, Redis On.

Powered by Discuz! X3.5

© 2001-2025 Discuz! Team.

快速回复 返回顶部 返回列表