How to Grab the HTML source code of a website URL index page?
This code will grab the HTML source from a given URL.
Change "website here.com" to a real URL starting with http:// and the program will display the index pages source code in the console.
The nice thing about this code is it spoofs the connection to make it look like its a web browser.
This enables you to navigate to sites like google that normally block connections from non web browser applications.
Code Java:
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
public class GrabHTML {
public static void Connect() throws Exception{
//Set URL
URL url = new URL("http://website here.com");
URLConnection spoof = url.openConnection();
//Spoof the connection so we look like a web browser
spoof.setRequestProperty( "User-Agent", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)" );
BufferedReader in = new BufferedReader(new InputStreamReader(spoof.getInputStream()));
String strLine = "";
//Loop through every line in the source
while ((strLine = in.readLine()) != null){
//Prints each line to the console
System.out.println(strLine);
}
System.out.println("End of page.");
}
public static void main(String[] args){
try{
//Calling the Connect method
Connect();
}catch(Exception e){
}
}
}
Re: How to Grab the HTML source of a website URL
I have found an interesting use for this code... my problem is that I need to be logged into the site first. Is there any way to pass the cookie information when opening the connection?
I can probably do it with some javascript and a webpage but would rather have it in one app.
Thanks for the source,
Jason
Re: How to Grab the HTML source of a website URL
Use the methods (in URLConnection) getHeaderField to retrieve the cookie, and setRequestProperty to set the cookie. Its a bit more complex than just that, so see the following link for a more detailed description: Handling Cookies Using the java.net API
Re: How to Grab the HTML source of a website URL
Re: How to Grab the HTML source of a website URL
This is really good, and think it might help me with the application I want to build.
How could I use this to only pull some information out of a site? for example to filter everything apart from prices of items?
Thanks
Duff
Re: How to Grab the HTML source of a website URL
You would have to grab the lot and then parse through the source code somehow, maybe a simple regex will work for you.
// Json
Re: How to Grab the HTML source of a website URL
Very nice JavaPF, very nice!
Together with copegs url and 'login by http post' this is perfect for me :P