extracting a list of few lines from html file
hello all, i'm beginner to java with few basic skills.
I want to extract few field from an html file
lets say a file has format as on page in following link
AIRCC - IJWMN - Journal
now I want to extract all the topics mentioned inthe page as a list, and store them in a text file
the text file should have only following fields
Architectures, protocols, and algorithms to cope with mobile & wireless Networks
Distributed algorithms of mobile computing
.
.
Wireless multimedia systems
Service creation and management environments for mobile/ wireless systems
how to do this?
I tried to have regexp used. But couldnot use it to its extent I guess.
Please help me to get through
Re: extracting a list of few lines from html file
Are you trying to parse the contents of an html page?
I think there are some third party packages that will help, but I don't have a link or name.
Re: extracting a list of few lines from html file
dear norm,
yes upto some extent yes, I need to parse that file. Or other way round I can directly read from url through connection object. But the problem i'm facing is to know exactly from where should I start and up to what point should I read the file in order to get a complete list of the topics given as list in the web page.
Hope you could give me a clue regarding it. :)
Re: extracting a list of few lines from html file
Sorry, I don't have the name of any packages to parse html.
Re: extracting a list of few lines from html file
You could do it trivially with javax.swing.text.html.parser.DocumentParser - read the Java SE API doc for that class and implement your own ParserCallback. You'll have to override handleStartTag and look for HTML.Tag.LI and use handleText to capture the text.
It's quite a bit of work to do it that way for just one page. Copy-pasting the list (I used Firefox and vim-gnome on Ubuntu just now) gives me plain text one line per list item. Using a Pattern on the whole document can be hard, but it would be easy to do it line-by-line if you were to read the file with BufferedReader.readLine(), for example.
Re: extracting a list of few lines from html file
Ok sir
i can understand that simply copy paste would have been a better option. But i just want to create a list from many such pages. And i would be glad if i could automate it somehow :)
i Will try with your suggestion for parser :) thank you