Web Crawler On Client Machine

<!– @page { size: 21cm 29.7cm; margin: 2cm } P { margin-bottom: 0.21cm } –>

HTML parser

which can be used for later search. Collecting agent includes
a simple HTML parser, which can read any HTML file and
We have designed a HTML parser that will scan
fetch useful information, such as title, pure text contents
the web pages and fetch interesting items such as title,
without HTML tag, and sub-link.
content and link. Other functionalities such as
discarding unnecessary items and restoring relative
hyperlink (part name link) to absolute hyperlink (full
path link) are also to be taken care of by the HTML
parser. During parsing, URLs are detected and added
to a list passed to the downloader program. At this
point exact duplicates are detected based on page
contents and links from pages found to be duplicates
are ignored to preserve bandwidth. The parser does
not remove all HTML tags. It cleans superfluous tags
and leaves only document structure. Information
about colors, backgrounds, fonts are discarded. The
resulting file sizes are typically 30% of the original
size and retain most of the information needed for
indexing.
2.  Creating an efficient multiple HTTP connection
Multiple concurrent HTTP connection is
considered to improve crawler performance. Each
HTTP connection is independent of the other so that
the connection can be used to download a page. A
Figure 1: High-level architecture of a standard Web
downloader is a high performance asynchronous
crawler.
HTTP client capable of downloading hundreds of
web pages in parallel. We use multi-thread and
asynchronous downloader. We use the asynchronous
downloader when there is no congestion in the traffic
and is used mainly in the Internet-enabled application
and activeX controls to provide a responsive
user-interface during file transfers. We have created
multiple asynchronous downloaders, wherein each
downloader works in parallel and downloads a page.
The scheduler has been programmed to use multiple
threads when the number of downloader object
exceeds a count of X.
3. Scheduling algorithm
As we are using multiple downloaders, we propose
a scheduling algorithm to use them in an efficient way.
The design of the downloader scheduler algorithm is
crucial as too many objects will exhaust many
resources and make the system slow, too small number
of downloader will degrade the system performance.
The scheduler algorithm is as follows:
1. System allocates a pre-defined number of
downloader objects
2.  while not empty Q  do
4.  After the downloader object downloads the contents
Q
3. Dequeue u
of web pages set its status as free.
4.  Fetch the contents of the URL asynchronously.
5.  If any downloader object runs longer than an upper
5.  I = I    {u } {Assign an index to the page visited and
time limit, abort it. Set its status as free.
pages indexed are considered as visited}
6.  If there are more than predefined number of
6.  Parse the HTML web page downloaded for text and
downloader or if all the
other links present. {u1, u2, u3, ...}
downloader objects are busy then allocate new
7.     for each {u1, u2, u3, …}    u do
threads and distribute the downloader to them.
8. if u1   I and u1   Q then
7.  Continue allocating the new threads and free threads
9.           Q = Q   {u1}
to the downloader until the number of downloader
10.        end if
becomes less than the threshold value, provided the
11.     end for
number of threads being used be kept under a limit.
12. end while
8. Goto 3.