Welcome to the Java Programming Forums


The professional, friendly Java community. 21,500 members and growing!


The Java Programming Forums are a community of Java programmers from all around the World. Our members have a wide range of skills and they all have one thing in common: A passion to learn and code Java. We invite beginner Java programmers right through to Java professionals to post here and share your knowledge. Become a part of the community, help others, expand your knowledge of Java and enjoy talking with like minded people. Registration is quick and best of all free. We look forward to meeting you.


>> REGISTER NOW TO START POSTING


Members have full access to the forums. Advertisements are removed for registered users.

Results 1 to 8 of 8

Thread: Eliminating duplicate values

  1. #1
    Junior Member
    Join Date
    Nov 2009
    Posts
    5
    Thanks
    1
    Thanked 0 Times in 0 Posts

    Question Eliminating duplicate values

    Hi,

    I have a csv file containing links to a site and number of time it was hit. It contains lakhs of such entries. Some link entries are duplicate like the following entry is present in the file multiple times:

    http://www.xyz.com/link1, 15
    http://www.xyz.com/link1, 13
    http://www.xyz.com/link1, 17
    http://www.xyz.com/link1, 18
    http://www.xyz.com/link1, 11

    Now I want to combine these link into a single entry in file like:

    http://www.xyz.com/link1, 74

    I have a program which reads lines from one file and writes them into second file. Whenever some duplicate comes it does not enter it into the second file but just add its value to number of hits field.

    In the end second file is created with unique links and number of times they were hit.

    But As I have to read single link from first file and compare it with each link in second file in loop. This is taking lot of time. (in days)


    Can anyone suggest me a better solution to this. Someone who has worked with regex might help.

    If I have posted this in wrong place in the site please send me the link where i should post this topic.

    Thanks in advance.


  2. #2
    Administrator copeg's Avatar
    Join Date
    Oct 2009
    Location
    US
    Posts
    5,320
    Thanks
    181
    Thanked 833 Times in 772 Posts
    Blog Entries
    5

    Default Re: Eliminating duplicate values

    I've done a similar sort of parsing quite a few times. A very simplistic way to do it is outlined below. It essentially makes a list and adds objects of type Website to the list. Website overrides the equals and hashCode methods (allows you to check for identical entries in the list), and add unique or merge previous. An alternative way to do this with simple data that you have, is to have a HashMap rather than a list, using the url as key, and an Integer object as value.

    This works for smaller types of files, but a major problem in doing it this way is the memory issues for really large files because this method requires that you read it all into memory (YUCK). Of course you can increase the memory of the JVM, but sometimes for really large files this code may not be optimal.

    An alternative to overcome the file size problem is write out to several files - either dynamically or in chunks - based upon some type of sorting (say alphabetize by the url), then merge these at the end. It makes checking the url's much faster because you can search smaller files rather than one large output file, and you may not have the memory problems because you don't have to read everything into an array as you can dynamically check them.

    import java.util.Arraylist;
    imprt java.io.*;
     
    public class Parser{
     
    	public static void main(String[] args){
    		///do all the file opening here
    		ArrayList<Website> list = new ArrayList<Website>();
    		String line = null;
    		while ( ( line = reader.readLine() ) != null ){
    			String splittee[] = line.split(",");
     
    			Website site = new Website(splittee[0].trim(), Integer.parseInt(splittee[1].trim()) );
    			if ( !list.contains(site) ){
    				list.add(site);
    			}else{
    				int index = list.indexOf(site);
    				list.get(index).merge(site);
     
    			}
    		}
    		//do all the file closing and cleanup
    		//write out to the output file
    	}
    	public static class Website{
    		private String url;
    		private int		count;
    		public Website( String u, int c ){
    			super();
    			this.url = u;
    			this.count = c;
    		}
    		@Override public boolean equals(Object o){
    			return url.equals((String)o);
    		}
    		@Override public int hashCode(){
    			return url.hashCode();
    		}
    		public void merge(Website s){
    			count += s.count;
    		}
    	}
    }
    Last edited by copeg; November 6th, 2009 at 03:14 PM.

  3. The Following User Says Thank You to copeg For This Useful Post:

    Harry_ (November 6th, 2009)

  4. #3
    Junior Member
    Join Date
    Nov 2009
    Posts
    5
    Thanks
    1
    Thanked 0 Times in 0 Posts

    Default Re: Eliminating duplicate values

    Thanks a ton for the reply Copeg. I will try this code tonight and will let you know how things worked out.

    Cheers

  5. #4
    Super Moderator Json's Avatar
    Join Date
    Jul 2009
    Location
    Warrington, United Kingdom
    Posts
    1,274
    My Mood
    Happy
    Thanks
    70
    Thanked 156 Times in 152 Posts

    Default Re: Eliminating duplicate values

    If I was you I'd just have it in a Map<String, Integer>, that would sort out your issues automatically.

    The key in the map would be the URL and the integer is just the number of hits.

    After you've read a line, check the map to see if the key already exists, if it does just increment its value otherwise create that key and insert its value.

    // Json

  6. #5
    Junior Member
    Join Date
    Nov 2009
    Posts
    5
    Thanks
    1
    Thanked 0 Times in 0 Posts

    Default Re: Eliminating duplicate values

    Thanks Jason for the reply.

    Yeah I tried using HasMAp and treeMap but the problem with them is that the resulting Map stores the values in a way which does not match the original order of input. (My links file is sorted according to number of hits).

    I want that order to be maintained and Map is not solving the purpose. Or can you tell me a way to get the values out from Map sorted according to its value.

  7. #6
    Super Moderator helloworld922's Avatar
    Join Date
    Jun 2009
    Posts
    2,896
    Thanks
    23
    Thanked 619 Times in 561 Posts
    Blog Entries
    18

    Default Re: Eliminating duplicate values

    Unfortunately, sets don't keep stuff in an ordered format

    The easiest solution to this dilemma is to use a list (probably an array) to store everything you read in, but on read-in, run through all the elements already added to see if the element you're adding is a duplicate of any element already in there, and handle that appropriately. Then, when you want to read out, all your elements are in a nice neat order.

    Another solution is to put everything into the hash map as Json said, with the addition of an integer that keeps track of what order they were added. Then, once everything has been added, take everything out of the hash map and sort it via your favorite sorting algorithm based on the integer you added that denotes what order they were added in.

    The first solution takes O(n^2) time to add everything. The second method takes O(n log(n)) time, so it is technically quicker. However, it's much more work to implement, and takes up more memory (twice as much at least, depending on how the hash table is implemented)
    Last edited by helloworld922; November 8th, 2009 at 03:46 PM.

  8. #7
    Junior Member
    Join Date
    Nov 2009
    Posts
    5
    Thanks
    1
    Thanked 0 Times in 0 Posts

    Smile Re: Eliminating duplicate values

    Thank you very much for replying....

    I am using the approach suggested by Copeg in the first reply. I am using Array list. Advantage using this is that I get list in the order in which it was already present. This list is also almost sorted as the initial list with few values out of order.

    I am sorting this list by taking first list item value and comparing it with second list item value. If first list item value is more then second then second is compared with third. The algo gos on until we had an item which is out of order. We insert that item at proper place and carries on. I dont know what this algorithm is called (If Someone can tell the name, it would be wonderful... ) . But good thing is that I am getting the whole thing done in less then 20 minutes for a file containing more then 4 lakh values... (Merging duplicate values and sorting resulting list)....

    Thanks again to you all for posting wonderful replies...

    Regards,
    Harry__
    Last edited by Harry_; November 8th, 2009 at 08:31 PM.

  9. #8
    Super Moderator Json's Avatar
    Join Date
    Jul 2009
    Location
    Warrington, United Kingdom
    Posts
    1,274
    My Mood
    Happy
    Thanks
    70
    Thanked 156 Times in 152 Posts

    Default Re: Eliminating duplicate values

    Just use a LinkedHashMap if you wish to have a predictable iteration order.

    See LinkedHashMap (Java Platform SE 6)

    You can of course also use a TreeMap and create your own Comparator which you pass in at creation time, then the map will automagically sort itself according to your comparator whenever you add stuff to it. Cakes

    // Json

Similar Threads

  1. Values of Input
    By chronoz13 in forum What's Wrong With My Code?
    Replies: 10
    Last Post: November 8th, 2009, 03:46 AM
  2. Substitution of Values in Array
    By nyeung in forum Collections and Generics
    Replies: 2
    Last Post: October 26th, 2009, 08:02 PM
  3. How to share variable values amongst different classes?
    By igniteflow in forum Object Oriented Programming
    Replies: 8
    Last Post: August 20th, 2009, 08:53 AM
  4. Reading IEEE float values from files.
    By username9000 in forum File I/O & Other I/O Streams
    Replies: 3
    Last Post: June 30th, 2009, 12:56 PM
  5. Java program to find the minimum and maximum values of input data
    By awake77 in forum What's Wrong With My Code?
    Replies: 6
    Last Post: December 20th, 2008, 05:12 PM