extract selected text out of a pdf file using java
hello everyone....
i want to genereate a report from an pdf file. The pdf file contains around 800 pages and it has a list of students along with there college code and their roll numbers. the roll no. is of format something like XXXXYYZZZZZZ
where XXXX is college code(eithe 0821or 0827 or 0831 or somethin like that Note college is 4 digit numeric code)
YY is the Branch code it is something like CS,ME,IT,EC ect. its two digit alphabetical code
ZZZZZZ is the roll code of the student it is 6 digit numeric code..
so in short the roll no contains
digits 1-4 college code (numeric code)
5-6 branch code (alphabetical code)
7-12 roll code (numeric code)
an example of roll no. is like 0821cs091021
the pdf file is sorted in terms of roll no. i.e. college wise ,branch wise in the the ascending order. It also contain a lot of information such as student name,fathers name, unversity name and logo and a lot more all these things are irrelevant to me.
only relevant data is the list of roll no.
all i want is to generate separate text files such that each file contains the roll no. of same college in the ascending order...
if u need a sampe pdf file then private message me your e mail id i will mail it to u..
Re: extract selected text out of a pdf file using java
Well if you don't expect to be finding any similiar strings, you could use a regular expression and parse each one. The regular expression would look something like
Then using that expression, just use the regexp classes and you can iterate through each id you find.
Lesson: Regular Expressions (The Java™ Tutorials > Essential Classes)
Re: extract selected text out of a pdf file using java
Use a 3rd party library such as Apache POI - the Java API for Microsoft Documents to extract out the text. Then parse the text as appropriate.
Re: extract selected text out of a pdf file using java
first of all thanks alot for reply..
and secondly i didnt find any library for pdf file in the APACHE POI link...
Re: extract selected text out of a pdf file using java
Regular expressions do not require an external library, I provided a link to a tutorial on using javas regular expression package. An example of cycling through a small phrase [Small enough to keep in memory] would be found at Test Harness (The Java™ Tutorials > Essential Classes > Regular Expressions)