Welcome to the Java Programming Forums


The professional, friendly Java community. 21,500 members and growing!


The Java Programming Forums are a community of Java programmers from all around the World. Our members have a wide range of skills and they all have one thing in common: A passion to learn and code Java. We invite beginner Java programmers right through to Java professionals to post here and share your knowledge. Become a part of the community, help others, expand your knowledge of Java and enjoy talking with like minded people. Registration is quick and best of all free. We look forward to meeting you.


>> REGISTER NOW TO START POSTING


Members have full access to the forums. Advertisements are removed for registered users.

Results 1 to 7 of 7

Thread: To remove duplicate values from 2 big files

  1. #1
    Member
    Join Date
    Mar 2011
    Posts
    114
    Thanks
    6
    Thanked 0 Times in 0 Posts

    Default To remove duplicate values from 2 big files

    Have 2 files File1.txt and file2.txt containing text values. Have to read both files and write unique values in to another file.
    What is the best approach for doing this?

    One Idea is to use HashMap for solving duplicate values. But in a interviews, thats not the expected answer.
    Anyone has any other idea?


  2. #2
    Super Moderator Norm's Avatar
    Join Date
    May 2010
    Location
    Eastern Florida
    Posts
    25,042
    Thanks
    63
    Thanked 2,708 Times in 2,658 Posts

    Default Re: To remove duplicate values from 2 big files

    Also a Set could be used to detect duplicates.

  3. #3
    Member
    Join Date
    Feb 2012
    Posts
    106
    My Mood
    Yeehaw
    Thanks
    8
    Thanked 11 Times in 11 Posts

    Default Re: To remove duplicate values from 2 big files

    well assuming the files are not monstrously large a single pass sort to start.

    then once sorted you would keep track of your files in tandem.

    showing an example of the process is better then explaining it in terms of code.

    file1..........................file2
    car.............................cast
    fantastic.....................daring
    fantastic.....................jelly
    jelly............................rabbit
    star............................star
    steam.........................steam
    telling..........................tell
    .................................yetti
    .................................zebra



    (it is recommended you just deal with duplicates in the same file when you are sorting the lists)
    so you would have a current word for each file
    current in file1 has to check against the current in file2

    the best way I feel to do this is to look at each character(char) and use the operators > < == to know more about how the letters of the word relate.
    Once you go through enough characters to get your answer, you upgrade the current word in the file that has been passed over by the other file.

    more detailed example
    cur1 = car
    cur2 = cast

    when we compare the characters
    c == c, so we continue
    a == a, so we continue
    r < s, so we stop

    we know that cur1 is less then cur2. this means that cur1 is done being looked at as it will never be in the list again.
    update it and repeat (probably adding it to the zero duplicates list depending on how you structure your code.)

    a successful word looks like this.
    cur1 = star
    cur2 = star

    s == s
    t == t
    a == a
    r == r
    end == end

    this is a matching word. so update both cur words.

    Note: tell != telling this means you have to know when

    end != (char)

    also remember to look at capitalization
    Hint: char is a primitive type, and it converts to type int by themselves. thus they compare without much work.

  4. #4
    Administrator copeg's Avatar
    Join Date
    Oct 2009
    Location
    US
    Posts
    5,320
    Thanks
    181
    Thanked 833 Times in 772 Posts
    Blog Entries
    5

    Default Re: To remove duplicate values from 2 big files

    Sorting would be quite inefficient compared to using some type of hash approach. A Map or a Set would be more appropriate (HashMap may 'not expected in an interview' given its supposed to keep track of key/value pairs which is not what your requirements are, but ironically a HashSet - which would be more in line with your requirements - is backed by a HashMap)

  5. #5
    Member
    Join Date
    Feb 2012
    Posts
    106
    My Mood
    Yeehaw
    Thanks
    8
    Thanked 11 Times in 11 Posts

    Default Re: To remove duplicate values from 2 big files

    Well I am curious now, What is a HashMap and how does it work?

  6. #6
    Administrator copeg's Avatar
    Join Date
    Oct 2009
    Location
    US
    Posts
    5,320
    Thanks
    181
    Thanked 833 Times in 772 Posts
    Blog Entries
    5

    Default Re: To remove duplicate values from 2 big files

    Quote Originally Posted by JonLane View Post
    Well I am curious now, What is a HashMap and how does it work?
    A quick google will give you a lot more information, but here are 2 resources:
    For an overview of the Map interface: The Map Interface (The Java™ Tutorials > Collections > Interfaces)
    For an overview of the data structure and how it works: Hash table - Wikipedia, the free encyclopedia

  7. #7
    Member clydefrog's Avatar
    Join Date
    Feb 2012
    Posts
    67
    Thanks
    15
    Thanked 2 Times in 2 Posts

    Default Re: To remove duplicate values from 2 big files

    I agree with copeg, the use of a set would be much more efficient; such as HashSet (if you dont care about the order) or a LinkedHashSet (if you want to maintain insertion order).

Similar Threads

  1. How to make set to allow duplicate values... Please help me!!!
    By JavaKiddo in forum Collections and Generics
    Replies: 9
    Last Post: January 4th, 2012, 08:22 AM
  2. Removing Duplicate Values From an Array
    By nicsa in forum What's Wrong With My Code?
    Replies: 6
    Last Post: November 30th, 2011, 07:55 PM
  3. duplicate t:dataTable
    By smackdown90 in forum Web Frameworks
    Replies: 0
    Last Post: August 4th, 2010, 10:35 AM
  4. Eliminating duplicate values
    By Harry_ in forum Collections and Generics
    Replies: 7
    Last Post: November 9th, 2009, 06:35 AM
  5. Reading IEEE float values from files.
    By username9000 in forum File I/O & Other I/O Streams
    Replies: 3
    Last Post: June 30th, 2009, 12:56 PM