How to read URLs from a web page
Dear Members,
I want to know are there any methods to read the URL of various links of a web page. To give you clarity, I wish to give you a realistic example. In the web site, "http://in.yahoo.com", there are many links in the form of Finance, Games, Life Style, News etc.
If you place the mouse pointer on any of those links, you can see the URL associated with it in the status bar at the bottom of the web browser. For instance, if you place the mouse pointer on the link "Games", you can see the URL "http://in.yahoo.com/r/ygms" displayed on the status bar at the bottom of the IE browser.
Similarly, if you place mouse pointer on the link of "Life Style", you can see the URL "http://in.yahoo.com/lfs" displayed on status bar at the bottom of the IE browser. In that web page, there are so many such links available.
My wish is that I want to write a Java Program (something like public class GrabURLs) that takes the URL of any web page (not necessarily, "http://in.yahoo.com", it can be any web page) in its constructor. From the URL which is passed in the constructor, the program has to find whether any links are available in that web page; if so, the program should grab all the links contained in that page in the form of String array or Vector.
For example, I write code something like :
GrabURLs webPage = new GrabURLs("http://in.yahoo.com");
String[] links = webPage.getLinks();
The links array is supposed to contain elements such as links[0] = "http://in.yahoo.com/r/ygms", links[1] = "http://in.yahoo.com/lfs" and so on.
Now what source code will I write for the method getLinks() of the class GrabURLs.
I would be delighted if someone gives a solution. The solution need not be a full Java program; at least, I want to know what are all the classes and methods involved to achieve this challenging task.
With best regards,
Abitha.
Re: How to read URLs from a web page
Code :
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
public class GrabURLs {
private URL url;
public static void main(String[] args){
GrabURLs gu = new GrabURLs("http://in.yahoo.com");
ArrayList<String> AL = gu.getLinks();
for(String line : AL){
System.out.println(line);
}
}
public GrabURLs(String urls){
try {
url = new URL(urls);
} catch (MalformedURLException e) {
e.printStackTrace();
}
}
public ArrayList<String> getLinks(){
BufferedReader urlIn = null;
ArrayList<String> links = new ArrayList<String>();
try {
urlIn = new BufferedReader(new InputStreamReader(url.openStream()));
} catch (IOException e) {
e.printStackTrace();
}
String s = null, t;
try {
while( ( t = urlIn.readLine()) != null){
s += t;
}
} catch (IOException e) {
e.printStackTrace();
}
String baseHREF = null;
baseHREF = s.substring(s.indexOf("<base href=") + 12 , s.indexOf("<base href=") + 12 + s.substring(s.indexOf("<base href=") + 12).indexOf("\""));
System.out.println(baseHREF);
while(s.indexOf("<a href=") != -1){
links.add(s.substring(s.indexOf("<a href=") + 9 , s.indexOf("<a href=") + 9 + s.substring(s.indexOf("<a href=") + 9).indexOf((s.substring(s.indexOf("<a href=") + 8, s.indexOf("<a href=") + 9).equals("'")) ? "'" : "\"")));
s = s.substring(s.indexOf("<a href=") + 9 + s.substring(s.indexOf("<a href=") + 9).indexOf("\""));
}
return links;
}
}
It's not perfect but you get the idea :)
Regards,
Chris