Welcome to the Java Programming Forums


The professional, friendly Java community. 21,500 members and growing!


The Java Programming Forums are a community of Java programmers from all around the World. Our members have a wide range of skills and they all have one thing in common: A passion to learn and code Java. We invite beginner Java programmers right through to Java professionals to post here and share your knowledge. Become a part of the community, help others, expand your knowledge of Java and enjoy talking with like minded people. Registration is quick and best of all free. We look forward to meeting you.


>> REGISTER NOW TO START POSTING


Members have full access to the forums. Advertisements are removed for registered users.

Results 1 to 7 of 7

Thread: How to Grab the HTML source code of a website URL index page?

  1. #1
    mmm.. coffee JavaPF's Avatar
    Join Date
    May 2008
    Location
    United Kingdom
    Posts
    3,336
    My Mood
    Mellow
    Thanks
    258
    Thanked 294 Times in 227 Posts
    Blog Entries
    4

    Post How to Grab the HTML source code of a website URL index page?

    This code will grab the HTML source from a given URL.

    Change "website here.com" to a real URL starting with http:// and the program will display the index pages source code in the console.

    The nice thing about this code is it spoofs the connection to make it look like its a web browser.
    This enables you to navigate to sites like google that normally block connections from non web browser applications.

    import java.io.BufferedReader;
    import java.io.InputStreamReader;
    import java.net.URL;
    import java.net.URLConnection;
     
    public class GrabHTML {
     
     public static void Connect() throws Exception{
     
      //Set URL
      URL url = new URL("http://website here.com");
      URLConnection spoof = url.openConnection();
     
      //Spoof the connection so we look like a web browser
      spoof.setRequestProperty( "User-Agent", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)" );
      BufferedReader in = new BufferedReader(new InputStreamReader(spoof.getInputStream()));
      String strLine = "";
     
      //Loop through every line in the source
      while ((strLine = in.readLine()) != null){
     
       //Prints each line to the console
       System.out.println(strLine);
      }
     
      System.out.println("End of page.");
     }
     
     public static void main(String[] args){
     
      try{
       //Calling the Connect method
       Connect();
      }catch(Exception e){
     
      }
     }
    }
    Please use [highlight=Java] code [/highlight] tags when posting your code.
    Forum Tip: Add to peoples reputation by clicking the button on their useful posts.

  2. The Following 2 Users Say Thank You to JavaPF For This Useful Post:

    Bryan (April 22nd, 2010), dave0110 (December 3rd, 2010)


  3. #2
    Junior Member
    Join Date
    Jan 2010
    Posts
    2
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default Re: How to Grab the HTML source of a website URL

    I have found an interesting use for this code... my problem is that I need to be logged into the site first. Is there any way to pass the cookie information when opening the connection?

    I can probably do it with some javascript and a webpage but would rather have it in one app.

    Thanks for the source,
    Jason

  4. #3
    Administrator copeg's Avatar
    Join Date
    Oct 2009
    Location
    US
    Posts
    5,320
    Thanks
    181
    Thanked 833 Times in 772 Posts
    Blog Entries
    5

    Default Re: How to Grab the HTML source of a website URL

    Use the methods (in URLConnection) getHeaderField to retrieve the cookie, and setRequestProperty to set the cookie. Its a bit more complex than just that, so see the following link for a more detailed description: Handling Cookies Using the java.net API

  5. #4
    Junior Member
    Join Date
    Jan 2010
    Posts
    2
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default Re: How to Grab the HTML source of a website URL

    Thanks for the help.

  6. #5
    Junior Member
    Join Date
    Feb 2010
    Posts
    5
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default Re: How to Grab the HTML source of a website URL

    This is really good, and think it might help me with the application I want to build.

    How could I use this to only pull some information out of a site? for example to filter everything apart from prices of items?
    Thanks
    Duff

  7. #6
    Super Moderator Json's Avatar
    Join Date
    Jul 2009
    Location
    Warrington, United Kingdom
    Posts
    1,274
    My Mood
    Happy
    Thanks
    70
    Thanked 156 Times in 152 Posts

    Default Re: How to Grab the HTML source of a website URL

    You would have to grab the lot and then parse through the source code somehow, maybe a simple regex will work for you.

    // Json

  8. #7
    Member
    Join Date
    Apr 2010
    Location
    The Hague, Netherlands
    Posts
    91
    Thanks
    3
    Thanked 10 Times in 10 Posts

    Default Re: How to Grab the HTML source of a website URL

    Very nice JavaPF, very nice!

    Together with copegs url and 'login by http post' this is perfect for me

Similar Threads

  1. Source code for Email address book/contacts importer
    By jega004 in forum Java Theory & Questions
    Replies: 4
    Last Post: November 23rd, 2012, 12:49 PM
  2. [SOLVED] Books and sources for Java beginners
    By chronoz13 in forum Java Theory & Questions
    Replies: 1
    Last Post: April 15th, 2009, 08:36 AM
  3. Replies: 3
    Last Post: March 9th, 2009, 09:47 AM