hi all,
i wanted to know whats the best way to extract text from a website? and then load it into a database?
at present i am using a web crawler to get the text from web. is there any other way of doing this?
Printable View
hi all,
i wanted to know whats the best way to extract text from a website? and then load it into a database?
at present i am using a web crawler to get the text from web. is there any other way of doing this?
you can see an example here
Thanks for the example.
But i need to crawl text not just from given url..but from url's within a given website(url).
well, once you can work with getting the text from 1 url, you can parse the text, search for further links, and then do a url connection to get contents from those links found. you have to do some recursive stuff here.
How about this in the code snippers forum:
http://www.javaprogrammingforums.com...bsite-url.html
Code Java:import java.io.BufferedReader; import java.io.InputStreamReader; import java.net.URL; import java.net.URLConnection; public class GrabHTML { public static void Connect() throws Exception{ //Set URL URL url = new URL("http://website here.com"); URLConnection spoof = url.openConnection(); //Spoof the connection so we look like a web browser spoof.setRequestProperty( "User-Agent", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)" ); BufferedReader in = new BufferedReader(new InputStreamReader(spoof.getInputStream())); String strLine = ""; //Loop through every line in the source while ((strLine = in.readLine()) != null){ //Prints each line to the console System.out.println(strLine); } System.out.println("End of page."); } public static void main(String[] args){ try{ //Calling the Connect method Connect(); }catch(Exception e){ } } }
It will grab the HTML source of a webpage. You can then process it as you wish..