Welcome to the Java Programming Forums


The professional, friendly Java community. 21,500 members and growing!


The Java Programming Forums are a community of Java programmers from all around the World. Our members have a wide range of skills and they all have one thing in common: A passion to learn and code Java. We invite beginner Java programmers right through to Java professionals to post here and share your knowledge. Become a part of the community, help others, expand your knowledge of Java and enjoy talking with like minded people. Registration is quick and best of all free. We look forward to meeting you.


>> REGISTER NOW TO START POSTING


Members have full access to the forums. Advertisements are removed for registered users.

Results 1 to 6 of 6

Thread: extract selected text out of a pdf file using java

  1. #1
    Junior Member
    Join Date
    Mar 2012
    Posts
    3
    Thanks
    2
    Thanked 0 Times in 0 Posts

    Default extract selected text out of a pdf file using java

    hello everyone....
    i want to genereate a report from an pdf file. The pdf file contains around 800 pages and it has a list of students along with there college code and their roll numbers. the roll no. is of format something like XXXXYYZZZZZZ

    where XXXX is college code(eithe 0821or 0827 or 0831 or somethin like that Note college is 4 digit numeric code)

    YY is the Branch code it is something like CS,ME,IT,EC ect. its two digit alphabetical code

    ZZZZZZ is the roll code of the student it is 6 digit numeric code..

    so in short the roll no contains
    digits 1-4 college code (numeric code)
    5-6 branch code (alphabetical code)
    7-12 roll code (numeric code)
    an example of roll no. is like 0821cs091021

    the pdf file is sorted in terms of roll no. i.e. college wise ,branch wise in the the ascending order. It also contain a lot of information such as student name,fathers name, unversity name and logo and a lot more all these things are irrelevant to me.
    only relevant data is the list of roll no.
    all i want is to generate separate text files such that each file contains the roll no. of same college in the ascending order...

    if u need a sampe pdf file then private message me your e mail id i will mail it to u..


  2. #2
    Forum VIP
    Join Date
    Oct 2010
    Posts
    275
    My Mood
    Cool
    Thanks
    32
    Thanked 54 Times in 47 Posts
    Blog Entries
    2

    Default Re: extract selected text out of a pdf file using java

    Well if you don't expect to be finding any similiar strings, you could use a regular expression and parse each one. The regular expression would look something like
    "\\d{4}\\D{2}\\d{6}"

    Then using that expression, just use the regexp classes and you can iterate through each id you find.

    Lesson: Regular Expressions (The Java™ Tutorials > Essential Classes)
    Last edited by Tjstretch; March 7th, 2012 at 12:09 PM. Reason: Simplified regexp

  3. The Following User Says Thank You to Tjstretch For This Useful Post:

    ashish.sharma (March 7th, 2012)

  4. #3
    Administrator copeg's Avatar
    Join Date
    Oct 2009
    Location
    US
    Posts
    5,320
    Thanks
    181
    Thanked 833 Times in 772 Posts
    Blog Entries
    5

    Default Re: extract selected text out of a pdf file using java

    Use a 3rd party library such as Apache POI - the Java API for Microsoft Documents to extract out the text. Then parse the text as appropriate.

  5. The Following User Says Thank You to copeg For This Useful Post:

    ashish.sharma (March 7th, 2012)

  6. #4
    Junior Member
    Join Date
    Mar 2012
    Posts
    3
    Thanks
    2
    Thanked 0 Times in 0 Posts

    Default Re: extract selected text out of a pdf file using java

    first of all thanks alot for reply..
    and secondly i didnt find any library for pdf file in the APACHE POI link...

  7. #5
    Forum VIP
    Join Date
    Oct 2010
    Posts
    275
    My Mood
    Cool
    Thanks
    32
    Thanked 54 Times in 47 Posts
    Blog Entries
    2

    Default Re: extract selected text out of a pdf file using java

    Regular expressions do not require an external library, I provided a link to a tutorial on using javas regular expression package. An example of cycling through a small phrase [Small enough to keep in memory] would be found at Test Harness (The Java™ Tutorials > Essential Classes > Regular Expressions)

  8. #6
    Junior Member
    Join Date
    Jul 2013
    Posts
    4
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default Re: extract selected text out of a pdf file using java

    You can extract text from pdf file with Aspose.PDF for Java Library. ...Click Here... to view the code.
    Last edited by copeg; December 2nd, 2013 at 09:35 AM. Reason: Removed link

Similar Threads

  1. Extract data from a file
    By kafka82 in forum File I/O & Other I/O Streams
    Replies: 2
    Last Post: September 1st, 2011, 03:02 AM
  2. How to extract text from web
    By HelloAll in forum Java Theory & Questions
    Replies: 4
    Last Post: January 10th, 2011, 05:50 AM
  3. [SOLVED] Parsing a text file in java
    By tccool in forum What's Wrong With My Code?
    Replies: 13
    Last Post: November 16th, 2010, 05:23 AM
  4. java program to copy a text file to onother text file
    By francoc in forum File I/O & Other I/O Streams
    Replies: 3
    Last Post: April 23rd, 2010, 03:10 PM
  5. Change of color for selected text in AWT
    By venkyInd in forum AWT / Java Swing
    Replies: 2
    Last Post: April 9th, 2010, 03:51 AM

Tags for this Thread