Eliminating duplicate values
Hi,
I have a csv file containing links to a site and number of time it was hit. It contains lakhs of such entries. Some link entries are duplicate like the following entry is present in the file multiple times:
http://www.xyz.com/link1, 15
http://www.xyz.com/link1, 13
http://www.xyz.com/link1, 17
http://www.xyz.com/link1, 18
http://www.xyz.com/link1, 11
Now I want to combine these link into a single entry in file like:
http://www.xyz.com/link1, 74
I have a program which reads lines from one file and writes them into second file. Whenever some duplicate comes it does not enter it into the second file but just add its value to number of hits field.
In the end second file is created with unique links and number of times they were hit.
But As I have to read single link from first file and compare it with each link in second file in loop. This is taking lot of time. (in days)
Can anyone suggest me a better solution to this. Someone who has worked with regex might help.
If I have posted this in wrong place in the site please send me the link where i should post this topic.
Thanks in advance.
Re: Eliminating duplicate values
I've done a similar sort of parsing quite a few times. A very simplistic way to do it is outlined below. It essentially makes a list and adds objects of type Website to the list. Website overrides the equals and hashCode methods (allows you to check for identical entries in the list), and add unique or merge previous. An alternative way to do this with simple data that you have, is to have a HashMap rather than a list, using the url as key, and an Integer object as value.
This works for smaller types of files, but a major problem in doing it this way is the memory issues for really large files because this method requires that you read it all into memory (YUCK). Of course you can increase the memory of the JVM, but sometimes for really large files this code may not be optimal.
An alternative to overcome the file size problem is write out to several files - either dynamically or in chunks - based upon some type of sorting (say alphabetize by the url), then merge these at the end. It makes checking the url's much faster because you can search smaller files rather than one large output file, and you may not have the memory problems because you don't have to read everything into an array as you can dynamically check them.
Code :
import java.util.Arraylist;
imprt java.io.*;
public class Parser{
public static void main(String[] args){
///do all the file opening here
ArrayList<Website> list = new ArrayList<Website>();
String line = null;
while ( ( line = reader.readLine() ) != null ){
String splittee[] = line.split(",");
Website site = new Website(splittee[0].trim(), Integer.parseInt(splittee[1].trim()) );
if ( !list.contains(site) ){
list.add(site);
}else{
int index = list.indexOf(site);
list.get(index).merge(site);
}
}
//do all the file closing and cleanup
//write out to the output file
}
public static class Website{
private String url;
private int count;
public Website( String u, int c ){
super();
this.url = u;
this.count = c;
}
@Override public boolean equals(Object o){
return url.equals((String)o);
}
@Override public int hashCode(){
return url.hashCode();
}
public void merge(Website s){
count += s.count;
}
}
}
Re: Eliminating duplicate values
Thanks a ton for the reply Copeg. I will try this code tonight and will let you know how things worked out.
Cheers :)
Re: Eliminating duplicate values
If I was you I'd just have it in a Map<String, Integer>, that would sort out your issues automatically.
The key in the map would be the URL and the integer is just the number of hits.
After you've read a line, check the map to see if the key already exists, if it does just increment its value otherwise create that key and insert its value.
// Json
Re: Eliminating duplicate values
Thanks Jason for the reply.
Yeah I tried using HasMAp and treeMap but the problem with them is that the resulting Map stores the values in a way which does not match the original order of input. (My links file is sorted according to number of hits).
I want that order to be maintained and Map is not solving the purpose. Or can you tell me a way to get the values out from Map sorted according to its value.
Re: Eliminating duplicate values
Unfortunately, sets don't keep stuff in an ordered format :P
The easiest solution to this dilemma is to use a list (probably an array) to store everything you read in, but on read-in, run through all the elements already added to see if the element you're adding is a duplicate of any element already in there, and handle that appropriately. Then, when you want to read out, all your elements are in a nice neat order.
Another solution is to put everything into the hash map as Json said, with the addition of an integer that keeps track of what order they were added. Then, once everything has been added, take everything out of the hash map and sort it via your favorite sorting algorithm based on the integer you added that denotes what order they were added in.
The first solution takes O(n^2) time to add everything. The second method takes O(n log(n)) time, so it is technically quicker. However, it's much more work to implement, and takes up more memory (twice as much at least, depending on how the hash table is implemented)
Re: Eliminating duplicate values
Thank you very much for replying....
I am using the approach suggested by Copeg in the first reply. I am using Array list. Advantage using this is that I get list in the order in which it was already present. This list is also almost sorted as the initial list with few values out of order.
I am sorting this list by taking first list item value and comparing it with second list item value. If first list item value is more then second then second is compared with third. The algo gos on until we had an item which is out of order. We insert that item at proper place and carries on. I dont know what this algorithm is called (If Someone can tell the name, it would be wonderful... :) ) . But good thing is that I am getting the whole thing done in less then 20 minutes for a file containing more then 4 lakh values... (Merging duplicate values and sorting resulting list).... :)
Thanks again to you all for posting wonderful replies...
Regards,
Harry__
Re: Eliminating duplicate values
Just use a LinkedHashMap if you wish to have a predictable iteration order.
See LinkedHashMap (Java Platform SE 6)
You can of course also use a TreeMap and create your own Comparator which you pass in at creation time, then the map will automagically sort itself according to your comparator whenever you add stuff to it. Cakes :)
// Json