Welcome to the Java Programming Forums


The professional, friendly Java community. 21,500 members and growing!


The Java Programming Forums are a community of Java programmers from all around the World. Our members have a wide range of skills and they all have one thing in common: A passion to learn and code Java. We invite beginner Java programmers right through to Java professionals to post here and share your knowledge. Become a part of the community, help others, expand your knowledge of Java and enjoy talking with like minded people. Registration is quick and best of all free. We look forward to meeting you.


>> REGISTER NOW TO START POSTING


Members have full access to the forums. Advertisements are removed for registered users.

Results 1 to 4 of 4

Thread: PARSE ANY FILE USING AUTO DETECT PARSER AND APACHE TIKA LIBRARY

  1. #1
    Junior Member
    Join Date
    Dec 2013
    Posts
    6
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Thumbs up PARSE ANY FILE USING AUTO DETECT PARSER AND APACHE TIKA LIBRARY

    import org.apache.tika.exception.TikaException;

    import org.apache.tika.metadata.Metadata;

    import org.apache.tika.parser.AutoDetectParser;

    import org.apache.tika.sax.BodyContentHandler;

    import org.xml.sax.ContentHandler;

    import org.xml.sax.SAXException;

    import java.util.*;

    //Example :
    public class ParseFile
    {
    void parseDoc(final String resourceLocation) throws IOException,

    SAXException, TikaException {

    InputStream input = new FileInputStream(new File(resourceLocation));

    ContentHandler textHandler = new BodyContentHandler();

    Metadata metadata = new Metadata();

    AutoDetectParser parser = new AutoDetectParser();

    parser.parse(input, textHandler, metadata);

    input.close();

    out.println(“Tika Parser starts……\n”);

    out.println(“file name: “+resourceLocation);

    out.println(“Title: ” + metadata.get(“title”));

    out.println(“Author: ” + metadata.get(“Author”));

    out.println(“content: ” + textHandler.toString());

    out.println(“Tika Parser stops……”);
    }


    }

    public class AutoParse
    {

    public static void main(String[] args)
    {
    ParseFile pf=new ParseFile();
    Scanner sc=new Scanner();
    final String resourceLocation=sc.readLine();
    pf.parseDoc(resourceLocation);
    }
    }

    /*using this example we can parse any compatible file by calling parseDoc() providing file name in parameter (with location )

    In the above example, I first create a FileInputStream containing the document to parse. Then I use a Tika content handler called BodyContentHandler that internally construct s content handler decorator of type XHTML to TextContextHandler . The decorator is actually forming the plain text output from the SAX event that the Parser emits. Next I instantiate a AutoDetectParser directly, call the parse method and close the stream. It is required to call close method of InputStream since it is not the responsibility of Parser to call it for user.

    Tika provides some readymade ContentHandler implementations that can be useful while parsing content with Tika.

    Finally, the metadata (input/output) parameter provides additional data to the parser as input and can return additional metadata out from the document. Examples of metadata include things like author name, number of pages, creation date, etc.*/
    /* for Parsing image file

    ImageParser parser = new ImageParser();



    for parsing PDF

    PDFParser parser = new PDFParser();

    */

    /*AutoDetectParser parser = newAutoDetectParser();

    Some use of Tika API
    Apache Tika- A Content Extraction Framework | amitbariar
    */


  2. #2
    Super Moderator Norm's Avatar
    Join Date
    May 2010
    Location
    Eastern Florida
    Posts
    25,042
    Thanks
    63
    Thanked 2,708 Times in 2,658 Posts

    Default Re: PARSE ANY FILE USING AUTO DETECT PARSER AND APACHE TIKA LIBRARY

    Check the CAPs lock on your keyboard. All the letters in the title are uppercase.
    If you don't understand my answer, don't ignore it, ask a question.

  3. #3
    Junior Member
    Join Date
    Dec 2013
    Posts
    6
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default Re: PARSE ANY FILE USING AUTO DETECT PARSER AND APACHE TIKA LIBRARY

    Quote Originally Posted by Norm View Post
    Check the CAPs lock on your keyboard. All the letters in the title are uppercase.
    It's not a big deal.
    I think you should focus on importance of content rather than caps lock.

    --- Update ---

    It's not a big deal.
    I think you should focus on importance of content rather than caps lock.

  4. #4
    Super Moderator Norm's Avatar
    Join Date
    May 2010
    Location
    Eastern Florida
    Posts
    25,042
    Thanks
    63
    Thanked 2,708 Times in 2,658 Posts

    Default Re: PARSE ANY FILE USING AUTO DETECT PARSER AND APACHE TIKA LIBRARY

    Many find ALL CAPS annoying. Like shouting in a small room. It's considered rude behavior.

    Do you have a problem or question about the code?

    Please edit your post and wrap your code with code tags:
    [code=java]
    YOUR CODE HERE
    [/code]
    to get highlighting and preserve formatting.
    If you don't understand my answer, don't ignore it, ask a question.

Similar Threads

  1. It doesn't detect my file
    By jtm429 in forum What's Wrong With My Code?
    Replies: 2
    Last Post: December 15th, 2013, 01:15 AM
  2. Replies: 0
    Last Post: November 16th, 2012, 05:25 AM
  3. How to Parse MusicXML file using DOM parser?
    By htechpro in forum File I/O & Other I/O Streams
    Replies: 0
    Last Post: March 15th, 2012, 08:17 PM
  4. Read a text file and parse the contents of file
    By HelloAll in forum File I/O & Other I/O Streams
    Replies: 1
    Last Post: March 3rd, 2011, 05:47 AM
  5. How can I detect the end of a line in a file ?
    By lumpy in forum Java Theory & Questions
    Replies: 3
    Last Post: February 18th, 2010, 03:13 AM

Tags for this Thread