Web crawler has been an interesting topic for discussion for the students of Computer Science and IT field. Whenever you query on google, along with the results you also get statistics just below the search bar :something like " 132,000 results in 0.23 seconds". Ever wondered , how could google query the search engine database so fast?. The answer is 'Web Crawler'. Google or any other search engine uses a web crawler to search the world wide web in an orderly fashion and to create a copy of the webpage in its database in such a way that it can easily query its database when needed to get results at alarming speeds.
Get the user's input: the starting URL and the desired
file type. Add the URL to the currently empty list of
URLs to search. While the list of URLs to search is
not empty,
{
Get the first URL in the list.
Move the URL to the list of URLs already searched.
Check the URL to make sure its protocol is HTTP
(if not, break out of the loop, back to "While").
See whether there's a robots.txt file at this site
that includes a "Disallow" statement.
(If so, break out of the loop, back to "While".)
Try to "open" the URL (that is, retrieve
that document From the Web).
If it's not an HTML file, break out of the loop,
back to "While."
Step through the HTML file. While the HTML text
contains another link,
{
Validate the link's URL and make sure robots are
allowed (just as in the outer loop).
If it's an HTML file,
If the URL isn't present in either the to-search
list or the already-searched list, add it to
the to-search list.
Else if it's the type of the file the user
requested,
Add it to the list of files found.
}
}
|
/************************************************************************** Compilation: javac WebCrawler.java In.java* Execution: java WebCrawler url* Downloads the web page and prints out all urls on the web
page.* Gives an idea of how Google's spider crawls the web. Instead of* looking for hyperlinks,we just look for patterns of the form:* http:// followed by an alternating sequence of alphanumeric* characters and dots, ending with a sequence of alphanumeric* characters.* % java WebCrawler http://www.slashdot.org* http://www.slashdot.org* http://www.osdn.com* http://sf.net* http://thinkgeek.com* http://freshmeat.net* http://newsletters.osdn.com* http://slashdot.org* http://osdn.com* http://ads.osdn.com* http://sourceforge.net* http://www.msnbc.msn.com* http://www.rhythmbox.org* http://www.apple.com* % java WebCrawler http://www.cs.princeton.edu* http://www.cs.princeton.edu* http://www.w3.org* http://maps.yahoo.com* http://www.princeton.edu* http://www.Princeton.EDU* http://ncstrl.cs.Princeton.EDU* http://www.genomics.princeton.edu* http://www.math.princeton.edu* http://libweb.Princeton.EDU* http://libweb2.princeton.edu* http://www.acm.org* Instead of setting the system property in the code, you could do it* from the commandline* % java -Dsun.net.client.defaultConnectTimeout=250 WebCrawler http://www.cs.princeton.edu********************************************************/import java.util.regex.Pattern;import java.util.regex.Matcher;public class WebCrawler {public static void main(String[] args) {// timeout connection after 500 milisecondsSystem.setProperty("sun.net.client.defaultConnectTimeout", "500");System.setProperty("sun.net.client.defaultReadTimeout", "1000");// initial web pageString s = args[0];// list of web pages to be examinedQueue<String> q = new Queue<String>();q.enqueue(s);// existence symbol table of examined web pagesSET<String> set = new SET<String>();set.add(s);// breadth first search crawl of webwhile (!q.isEmpty()) {String v = q.dequeue();System.out.println(v);In in = new In(v);// only needed in case website does not respondif (!in.exists()) continue;String input = in.readAll();/*********************************** Find links of the form: http://xxx.yyy.zzz* \\w+ for one or more alpha-numeric characters* \\. for dot* could take first two statements out of loop*******************************************/String regexp = "http://(\\w+\\.)*(\\w+)";Pattern pattern = Pattern.compile(regexp);Matcher matcher = pattern.matcher(input);// find and print all matcheswhile (matcher.find()) {String w = matcher.group();if (!set.contains(w)) {q.enqueue(w);set.add(w);}}}}}