To remove duplicate values from 2 big files
Have 2 files File1.txt and file2.txt containing text values. Have to read both files and write unique values in to another file.
What is the best approach for doing this?
One Idea is to use HashMap for solving duplicate values. But in a interviews, thats not the expected answer.
Anyone has any other idea?
Re: To remove duplicate values from 2 big files
Also a Set could be used to detect duplicates.
Re: To remove duplicate values from 2 big files
well assuming the files are not monstrously large a single pass sort to start.
then once sorted you would keep track of your files in tandem.
showing an example of the process is better then explaining it in terms of code.
file1..........................file2
car.............................cast
fantastic.....................daring
fantastic.....................jelly
jelly............................rabbit
star............................star
steam.........................steam
telling..........................tell
.................................yetti
.................................zebra
(it is recommended you just deal with duplicates in the same file when you are sorting the lists)
so you would have a current word for each file
current in file1 has to check against the current in file2
the best way I feel to do this is to look at each character(char) and use the operators > < == to know more about how the letters of the word relate.
Once you go through enough characters to get your answer, you upgrade the current word in the file that has been passed over by the other file.
more detailed example
cur1 = car
cur2 = cast
when we compare the characters
c == c, so we continue
a == a, so we continue
r < s, so we stop
we know that cur1 is less then cur2. this means that cur1 is done being looked at as it will never be in the list again.
update it and repeat (probably adding it to the zero duplicates list depending on how you structure your code.)
a successful word looks like this.
cur1 = star
cur2 = star
s == s
t == t
a == a
r == r
end == end
this is a matching word. so update both cur words.
Note: tell != telling this means you have to know when
end != (char)
also remember to look at capitalization
Hint: char is a primitive type, and it converts to type int by themselves. thus they compare without much work.
Re: To remove duplicate values from 2 big files
Sorting would be quite inefficient compared to using some type of hash approach. A Map or a Set would be more appropriate (HashMap may 'not expected in an interview' given its supposed to keep track of key/value pairs which is not what your requirements are, but ironically a HashSet - which would be more in line with your requirements - is backed by a HashMap)
Re: To remove duplicate values from 2 big files
Well I am curious now, What is a HashMap and how does it work?
Re: To remove duplicate values from 2 big files
Quote:
Originally Posted by
JonLane
Well I am curious now, What is a HashMap and how does it work?
A quick google will give you a lot more information, but here are 2 resources:
For an overview of the Map interface: The Map Interface (The Java™ Tutorials > Collections > Interfaces)
For an overview of the data structure and how it works: Hash table - Wikipedia, the free encyclopedia
Re: To remove duplicate values from 2 big files
I agree with copeg, the use of a set would be much more efficient; such as HashSet (if you dont care about the order) or a LinkedHashSet (if you want to maintain insertion order).