Web Crawler On Client Machine
<!– @page { size: 21cm 29.7cm; margin: 2cm } P { margin-bottom: 0.21cm } –>
HTML parser which can be used for later search. Collecting agent includes a simple HTML parser, which can read any HTML file and We have designed a HTML parser that will scan fetch useful information, such as title, pure text contents the web pages and fetch interesting items such as title, without HTML tag, and sub-link. content and link. Other functionalities such as discarding unnecessary items and restoring relative hyperlink (part name link) to absolute hyperlink (full path link) are also to be taken care of by the HTML parser. During parsing, URLs are detected and added to a list passed to the downloader program. At this point exact duplicates are detected based on page contents and links from pages found to be duplicates are ignored to preserve bandwidth. The parser does not remove all HTML tags. It cleans superfluous tags and leaves only document structure. Information about colors, backgrounds, fonts are discarded. The resulting file sizes are typically 30% of the original size and retain most of the information needed for indexing. 2. Creating an efficient multiple HTTP connection Multiple concurrent HTTP connection is considered to improve crawler performance. Each HTTP connection is independent of the other so that the connection can be used to download a page. A Figure 1: High-level architecture of a standard Web downloader is a high performance asynchronous crawler. HTTP client capable of downloading hundreds of web pages in parallel. We use multi-thread and asynchronous downloader. We use the asynchronous downloader when there is no congestion in the traffic and is used mainly in the Internet-enabled application and activeX controls to provide a responsive user-interface during file transfers. We have created multiple asynchronous downloaders, wherein each downloader works in parallel and downloads a page. The scheduler has been programmed to use multiple threads when the number of downloader object exceeds a count of X. 3. Scheduling algorithm As we are using multiple downloaders, we propose a scheduling algorithm to use them in an efficient way. The design of the downloader scheduler algorithm is crucial as too many objects will exhaust many resources and make the system slow, too small number of downloader will degrade the system performance. The scheduler algorithm is as follows: 1. System allocates a pre-defined number of downloader objects 2. while not empty Q do 4. After the downloader object downloads the contents Q 3. Dequeue u of web pages set its status as free. 4. Fetch the contents of the URL asynchronously. 5. If any downloader object runs longer than an upper 5. I = I {u } {Assign an index to the page visited and time limit, abort it. Set its status as free. pages indexed are considered as visited} 6. If there are more than predefined number of 6. Parse the HTML web page downloaded for text and downloader or if all the other links present. {u1, u2, u3, ...} downloader objects are busy then allocate new 7. for each {u1, u2, u3, …} u do threads and distribute the downloader to them. 8. if u1 I and u1 Q then 7. Continue allocating the new threads and free threads 9. Q = Q {u1} to the downloader until the number of downloader 10. end if becomes less than the threshold value, provided the 11. end for number of threads being used be kept under a limit. 12. end while 8. Goto 3.