Welcome to the Java Programming Forums


The professional, friendly Java community. 21,500 members and growing!


The Java Programming Forums are a community of Java programmers from all around the World. Our members have a wide range of skills and they all have one thing in common: A passion to learn and code Java. We invite beginner Java programmers right through to Java professionals to post here and share your knowledge. Become a part of the community, help others, expand your knowledge of Java and enjoy talking with like minded people. Registration is quick and best of all free. We look forward to meeting you.


>> REGISTER NOW TO START POSTING


Members have full access to the forums. Advertisements are removed for registered users.

Results 1 to 4 of 4

Thread: Crawler is getting slow

  1. #1
    Junior Member
    Join Date
    Oct 2010
    Posts
    4
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default Crawler is getting slow

    This is a problem i am facing for a long time. I am using a crawler program to crawl the web which is a multi-threaded, stand alone java program. Currently i am crawling nearly 40 websites using this. It is working fine till it crawls around 20 websites, after that, it is getting slow gradually and atlast taking a long time to complete it...I also checked for JVM memory, It is always having nearly 70 to 90 percentage of the memory, free....

    Recently, i got doubt in the threading and tried the crawling just for 5 websites, and i was shocked to see it taking nearly 45 mins for 3 websites and the remaining 2, took nearly 3 hrs.....

    Because of this problem, i can't even increase the number of websites for crawling......

    Can any one please give me a solution regarding this.....

    Thanks in advance........


  2. #2
    mmm.. coffee JavaPF's Avatar
    Join Date
    May 2008
    Location
    United Kingdom
    Posts
    3,336
    My Mood
    Mellow
    Thanks
    258
    Thanked 287 Times in 225 Posts
    Blog Entries
    4

    Default Re: Crawler is getting slow

    Take a look at this similar thread

    Web Crawling is too Slow!

    Can you post your code for us to look at? There are different factors apart from the code which could add to the slowness.
    Please use [highlight=Java] code [/highlight] tags when posting your code.
    Forum Tip: Add to peoples reputation by clicking the button on their useful posts.

    Looking for a Java job? Visit - Java Programming Careers

  3. #3
    Junior Member
    Join Date
    Oct 2010
    Posts
    4
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default Re: Crawler is getting slow

    Thanks for the response....

    Yeah...i already got a look on that...Though i couldn't figure out the problems from mine....Here i am posting the code, which has two segments, threading and crawling...

    In the below code i had omitted the sql queries as much as possible.....

    Threading

    public class FewThreadList extends Thread
    {
     
    	public static void main(String args[])
    	{
    		ArrayList outletArray=new ArrayList();
    		java.util.concurrent.ExecutorService es=Executors.newFixedThreadPool(2);
    		for(int i=0;i<outletArray.size();i+=2)
    		{
     
    			String outlet_ID=outletArray.get(i).toString();
    			String url=outletArray.get(i+1).toString();
    //Where Url is the web site's home page and outlet_ID is a unique id of it
    			es.execute(new WebCrawlerBatch(url,Integer.parseInt(outlet_ID)));
    		}
     
    		es.shutdown();
    		outletArray=null;
     
    	}
    }

    Crawling ==> WebCrawlerBatch.java's code:

    class WebCrawlerBatch extends Thread//MK
    {
     
     
    String StartUrl="";
    int Outlets_ID=0;
    	//StringBuffer pageBuffer=null;
    	StringBuilder pageBuffer=null;
    	BufferedReader reader=null;
    	int respCode = 0;
    	String line="";
    	//pageBuffer =
    	ArrayList linkList =null;
    	SBSearchEngine sbSearch=null;
     
    //File destfile=new File("E:\\OUTPUT_LOG.txt");
    public WebCrawlerBatch(String startUrl, int Outlet_ID)
    {
    	this.startUrl=startUrl;
    	this.Outlet_ID=Outlet_ID;
    }
    public void run()//MK
    {
     
    	  actionSearch(startUrl,Outlet_ID);
    }
     
     
    private boolean crawling;
    // Constructor for Search Web Crawler.
    String startUrl;
    int Outlet_ID;
     
     
    private void actionSearch(String startUrl,int Outlet_ID)
    {
    if (crawling) {
    crawling = false;
    return;
    }
     
    int maxUrls = 1000000000; //maximum value of int =2,147,483,647
     
    String searchString =" ";
     
     
    startUrl = removeWwwFromUrl(startUrl);
    //System.out.println("Free Memory in actionSearch module after WWWUrl modification " +Runtime.getRuntime().freeMemory());
    search( startUrl, maxUrls, searchString,Outlet_ID);
     
     
     
    }
    public  String check(String htmltext,String strStart,String strEnd,String url)
    {
    int maxLen=1000;
    String searchdata = StringUtils.substringBetween(htmltext, strStart, strEnd);
     
    if(!(searchdata==null))
    {
    if(searchdata.length()>maxLen)
    searchdata=searchdata.substring(0,maxLen-1);
    searchdata=searchdata.trim();
    return searchdata; }
     
    else{
     
    return "NIL";
    }
     
    }
     
     
    // Run the Search Crawler.
     
    private void search( final String startUrl, final int maxUrls, final String searchString,final int Outlet_ID )
    {
     
    crawling = true;
    crawl(startUrl, maxUrls, true,searchString, true,Outlet_ID);
    crawling = false;
    //System.out.println("Free Memory at the end of Search module" +Runtime.getRuntime().freeMemory());
     
    }
     
    // Verify URL format.
    private URL verifyUrl(String url) {
    // Only allow HTTP URLs.
    if (!url.toLowerCase().startsWith("http://"))
    return null;
    // Verify format of URL.
    URL verifiedUrl = null;
    try {
    verifiedUrl = new URL(url);
    } catch (Exception e) {
    return null;
    }
    return verifiedUrl;
    }
     
    private String downloadPage(URL pageUrl, int Outle_ID) {
     
    	try {
    	       URLConnection uconn=pageUrl.openConnection();
    	       Calendar c1=Calendar.getInstance();
    			 c1.set(2010, 04, 23);
    			 long c2 = c1.getTimeInMillis();
    			 uconn.setIfModifiedSince(c2);
    			if(!(uconn instanceof HttpURLConnection))
    			{
    				throw new java.lang.IllegalArgumentException( "URL protocol must be HTTP." );
    				}
    		final java.net.HttpURLConnection conn1 =(java.net.HttpURLConnection)uconn;
     
    		respCode=conn1.getResponseCode();
    		conn1.setConnectTimeout(50000);
    		conn1.setReadTimeout(50000);
    		if(respCode != 404 && respCode != 403 && respCode!=400 && respCode!=504 && respCode !=502)
    		{
    	    reader = new BufferedReader(new InputStreamReader(
    conn1.getInputStream()));
    pageBuffer=new StringBuilder();
     
    while ((line = reader.readLine()) != null)
    {
    	pageBuffer.append(line);
     
    }
    reader.close();
    }
    try
    {
    	conn1.disconnect();
    }
    catch(Exception ee){}
     
    return pageBuffer.toString();
    }
    catch(java.net.SocketTimeoutException timeout){System.out.println("Timed out");}
    catch(java.io.IOException ioe){/*ioe.printStackTrace();//System.out.println("FileNotFoundException IS CLOSED HERE ");*/
    }
    catch (Exception e)
    {
    	//e.printStackTrace();
    }
     
    pageBuffer=null;
    return null;
    }
     
     
    // Remove leading "www" from a URL's host if present.
    private String removeWwwFromUrl(String url) {
    Pattern p=Pattern.compile("://www\\d*.",Pattern.CASE_INSENSITIVE);
    Matcher m= p.matcher(url);
    if(m.find())
    {
    	String wwwcut = m.group().trim();
    	int c1=url.indexOf(wwwcut);
    	int c = wwwcut.lastIndexOf(".")+1;
    	return url.substring(0, c1+3)+url.substring(c1+c);
    }
    return (url);
    }
     
    private ArrayList retrieveLinks(
    URL pageUrl, String pageContents, HashSet crawledList,
    boolean limitHost,String pgstart,LinkedHashSet LandPage, LinkedHashSet finalCrawlList)
    {
     
    	//pgstart=removeWwwFromUrl(pgstart);
     
    pgstart=	removeWwwFromUrl(pgstart);
    // Compile link matching pattern.
    Pattern p =  Pattern.compile("<a\\s+href\\s*=\\s*\"?(.*?)[\"|>]",Pattern.CASE_INSENSITIVE);
     
    Matcher m = p.matcher(pageContents);
    // Create list of link matches.
    linkList = new ArrayList();
    String link="";
    String file="";
    String path="";
    int index ;
    URL verifiedLink=null;
    while (m.find()) {
     link = m.group(1).trim();
    // Skip empty links.
    if (link.length() < 1) {
    continue;
    }
    // Skip links that are just page anchors.
    if (link.charAt(0) == '#') {
    continue;
    }
    // Skip mailto links.
    if (link.indexOf("mailto:") != -1) {
    continue;
    }
    // Skip JavaScript links.
    if (link.toLowerCase().indexOf("javascript") != -1) {
    continue;
    }
    // Prefix absolute and relative URLs if necessary.
    if (link.indexOf("://") == -1) {
    // Handle absolute URLs.
    if (link.charAt(0) == '/') {
    link = "http://" + pageUrl.getHost() + link;
    // Handle relative URLs.
    } else {
     file = pageUrl.getFile();
    if (file.indexOf('/') == -1) {
    link = "http://" + pageUrl.getHost() + "/" + link;
    } else {
     path =
    file.substring(0, file.lastIndexOf('/') + 1);
    link = "http://" + pageUrl.getHost() + path + link;
    }
    }
    }
    // Remove anchors from link.
     index = link.indexOf('#');
    if (index != -1) {
    link = link.substring(0, index);
    }
    // Remove leading "www" from URL's host if present.
    link = removeWwwFromUrl(link);
    // Verify link and skip if invalid.
     verifiedLink = verifyUrl(link);
    if (verifiedLink == null) {
    continue;
    }
    /* If specified, limit links to those
    having the same host as the start URL. */
    if (limitHost &&
    !pageUrl.getHost().toLowerCase().equals(
    verifiedLink.getHost().toLowerCase()))
    {
    continue;
    }
    // Skip link if it has already been crawled.
    /*if (crawledList.contains(link)) {
    continue;
    }*/
    if(link.charAt(link.length()-1)=='/')
    {
    	link=link.substring(0, link.length()-1);
    	System.out.println("The link is made as"+link);
    }
    if(finalCrawlList.contains(link))
    {
    	System.out.println("The link ignored becoz of finalCrawlList    " +link);
    	continue;
    }
    if(LandPage.contains(link))
    {
    	System.out.println("blocked with Landingpage check 1");
    	continue;
    }
     
    if(link.startsWith(pgstart))
    {
    	System.out.println("The link COMING FOR ARTICLE CHECK IS"+link);
    	//if(!link.endsWith("/"))
    	if(link.matches("http://.*(STORY|ARTICLE|story|article|\\d{1,}).*"))
    	{
    		if(!linkList.contains(link))
    		linkList.add(link);
    		System.out.println("Link added 1");
    	}
    	else if(link.matches("http://.*(/.*){4,}.*"))
    	{
    		if(!linkList.contains(link))
    		linkList.add(link);
    		System.out.println("Link added 2");
    	}
     
    	}
     
     
    //System.out.println("LINK avoided IS   "+link);
    //linkList.add(link);//MK comment this after test sept 27
    }
     
    return (linkList);
    }
     
    public void crawl(
    String startUrl, int maxUrls, boolean limitHost,
    String searchString, boolean caseSensitive,int Outlet_ID)
    {
    	long before=new Date().getTime();
    //	System.out.println("Third print of the outlet" +Outlet_ID);
    	String sourceURL=startUrl;
    int count=0;
    int i=0;
    int j=0;
    // Set up crawl lists.
    String strResult="";
     
    HashSet crawledList = new HashSet();
    LinkedHashSet toCrawlList = new LinkedHashSet();
    LinkedHashSet finalCrawlList =new  LinkedHashSet();
     
    int k=0;
     
    	//strLandingPage is the Home Page
    	if(strLandingPage.charAt(strLandingPage.length()-1)=='/')
    	{
    		strLandingPage=strLandingPage.substring(0, strLandingPage.length()-1);
    		System.out.println("The LANDING PAGE  is made as"+strLandingPage);
    	}
    	toCrawlList.add(strLandingPage);
    lpcount=lpcount+1;
    k=k+2;
    	}
     
     
    	while (crawling && toCrawlList.size() > 0)
    	{
    	/* Check to see if the max URL count has
    	been reached, if it was specified.*/
    	if (maxUrls != -1) {
    	if (crawledList.size() == maxUrls) {
    	break;
    	}
    	}
     
    	 url = (String) toCrawlList.iterator().next();
     
    	toCrawlList.remove(url);
     
    	 verifiedUrl = verifyUrl(url);
     
     
    	// Add page to the crawled list.
    	crawledList.add(url);
     
     
     
    downloadedPgCount++;
     
     
     
    	try{
    	 pageContents = downloadPage(verifiedUrl, Outlet_ID);
     }
     catch(java.lang.NullPointerException nexp){System.out.println("NULL POINTER EXCEPTION CAUGHT 1(Landing page)");}
     
     
    	if (pageContents != null && pageContents.length() > 0)
    	{
     
    	 pagestart=arrayLP.get(j+1).toString();
    	 if(pagestart.charAt(pagestart.length()-1)=='/')
    	{
    		pagestart=pagestart.substring(0, pagestart.length()-1);
    		System.out.println("The PAGe STARt  is made as"+pagestart);
    	}
    	 links =retrieveLinks(verifiedUrl, pageContents, crawledList,limitHost,pagestart,toCrawlList, finalCrawlList);
     
    	finalCrawlList.addAll(links);
     
     
    }
    	 //End of If page contents
     
    landpgcount++;
     
     
     
     
     
     
     
    j=j+2;
     
     
     
     
    }
     
    links.clear();
     
    linkList.clear();
     
    crawledList.clear();
     
    toCrawlList.clear();
     
    URL downloadURL=null;
     
    Iterator itr = finalCrawlList.iterator();
     
    System.out.println("With out doing the sql updates " + (Runtime.getRuntime().freeMemory()*100)/Runtime.getRuntime().totalMemory());
    int h=0;
     
    while(itr.hasNext())
    {
    try{
    	downloadURL = new URL(itr.next().toString());
    	itr.remove();
    }
    catch (java.net.MalformedURLException e){
    e.printStackTrace();
    System.out.println("Exceptioned url and outlet id are:" +downloadURL);
    }
     
    try
    {
    pageContents = downloadPage(downloadURL, Outlet_ID);
    }
    catch(java.lang.NullPointerException nex){System.out.println("NULL POINTER EXCEPTION CAUGHT 2(link)");}
     
     
     
    pageBuffer=null;
    reader.close();
     
    //NOTE:SOME STRING OPERATIONS ARE DONE HERE TO RETRIEVE DATA FROM THE DOWNLOADED CONTENT
     
    }
    System.gc();
     
    	pagestart="";
     
    	    pageContents="";
     
     
    		arrayLP.clear();
     
    	    finalCrawlList.clear();
     
        System.out.println("cleared all arraylists");
     
     
     
     
     
    System.out.println("the percentage of free memory at the end is"+ (Runtime.getRuntime().freeMemory()*100)/Runtime.getRuntime().totalMemory());
    System.out.println("At the End" +Outlet_ID);
     
    } //End of try
     
     
    catch(ClassNotFoundException c) {System.out.println("ClassNotFoundException");c.printStackTrace();}
    catch(InstantiationException I){System.out.println("InstantiationException");}
    catch(IllegalAccessException IA){System.out.println("IllegalAccessException");}
     
     
     
    }
     
    }

  4. #4
    Administrator copeg's Avatar
    Join Date
    Oct 2009
    Location
    US
    Posts
    5,297
    Thanks
    180
    Thanked 824 Times in 767 Posts
    Blog Entries
    5

    Default Re: Crawler is getting slow

    Your code is a bit unreadable given the lack of tabbing, so its hard to pinpoint anything that stands out. I'd suggest using the System.currentTimeMillis function to time each process that occurs to try and determine where the bottleneck is occurring.

Similar Threads

  1. Blackjack program runs excrutiatingly slow.
    By sina12345 in forum What's Wrong With My Code?
    Replies: 5
    Last Post: December 10th, 2010, 03:55 AM
  2. Web Crawling is too Slow!
    By sankar236 in forum Java Networking
    Replies: 5
    Last Post: September 29th, 2010, 09:37 AM
  3. Audioclip play makes applets slow
    By Marcus in forum What's Wrong With My Code?
    Replies: 0
    Last Post: February 19th, 2010, 12:20 PM
  4. How do we navigate to a page with the anchor link in Web Crawler program?
    By expertOpinion in forum What's Wrong With My Code?
    Replies: 5
    Last Post: September 10th, 2009, 04:44 PM