Reverse Duplicate

23 posts / 0 new
Last post
Brian Element's picture
Brian Element
Offline
Joined: 07/11/2012 - 19:57

Hi Sri,

How about trying something like this.

First make sure you remove all the duplicates.  First add a virtual field call record number using the @precno function.  This way when we are doing the analysis we can trace the record back to the original file. 

Then perform Duplicate Key Detection using the ID and ID1 fields as the key.

In your example it will show that records 15, 17, 18 and 20 are duplicates and should be removed from the file.

You can remove them by doing a join with the mod file being the primary and the duplicate file being the secondary.  The match is on the record number and you use the Records with no secondary match.

So now we have a file with no duplicates.

Now comes the fun part, I finally decided to create a virtual field adding the ID and ID1 fields together.

This gives a unique key that will be the same if the ID + ID1 is the same as any other ID + ID1

I then did a summary by ID_ADD and in the Fields I selected the REC_NO to be included and I used the Use fields from last occurrence option.  This will give me the record number of the second transaction when there is a match.

So the results shows the REC_NO we must remove from the main file.

In this case they are all duplicates, if there are any non-duplicates then we would have to do one additional step by extracting NO_OF_RECS = 2.  We join this file back to the main file using the REC_NO as the match and Records with no secondary match as the match option and hopefully you will now have a file with no duplicates.

Let me know how this works out.

Brian

srivibish
Offline
Joined: 03/04/2017 - 03:27

Hi Brain,
I thought about this but I am afraid that what if by any chance two different number add up to same total. Considering the size of the database we can't trace back as well.
Sri

Brian Element's picture
Brian Element
Offline
Joined: 07/11/2012 - 19:57

I am thinking this is the way to go but I undersand the possible problem as the numbers could be duplicated.  So why not use an algorithm to change the numbers to something that would be more unique.  My suggestion would be to take the first ID and multiple each number of that ID by another number and concatenate them together.  In this example I used the ASCII code of the number as the multiplier.  So if you have the following:

ID 21771 ID1 21318

would be

(2 * 50) & (1 * 49) & (7 * 55) & (7 * 55) & (1 * 49) =

100 & 49 & 385 & 385 & 49 =

So your new ID would be 1004938538549, you do the same for ID1 and then add them together so the ID would really be unique.  Do you think that would work?  If not I think it comes down to creating a script that will take the first row, look through the rest of the file for a match, save the match record number to a another file, then do the second row and so on.  You then use the file to do a join to remove all the duplicates.

srivibish
Offline
Joined: 03/04/2017 - 03:27

I think this would be a good way to overcome the same total issue. But could you please tell me how to convert and multiply each element of the ID with the ASCII code. Is there any function that I could use for this?

Brian Element's picture
Brian Element
Offline
Joined: 07/11/2012 - 19:57

Hi Sri,

I started to do this in the equation editor but it would have been pretty messy so I created a custom function for you.  I have attached it so try it out.  It is called SRI_Convert.

Brian

srivibish
Offline
Joined: 03/04/2017 - 03:27

Hi Brian,
Thanks for the custom function. But it is unfortunate that after after adding the converted ID's, I could see lot of pairs ending up into same total.  I doubt we have any other options open to solve this problem.
Regards,
Sri

Brian Element's picture
Brian Element
Offline
Joined: 07/11/2012 - 19:57

I think the final option is to create a script.  When I have a bit of time I will put something together.  Any chance you can give me a larger file to work with?

Thanks

Brian

srivibish
Offline
Joined: 03/04/2017 - 03:27

Sure. One thing is apart from the fields give in the attachment, there will be someother fields in my database. Will that affect the script by any chance.
Regards,
Sri

Files: 
Brian Element's picture
Brian Element
Offline
Joined: 07/11/2012 - 19:57

Thanks for the additional data.  No, I will copy over all the information from the records that are kept.

Brian Element's picture
Brian Element
Offline
Joined: 07/11/2012 - 19:57

Hi Sri,

Ok, you want to try this script out.  What it does is display a dialog in which you select your file and the two ID Columns.  The script then adds a recno field to the main file and goes through the file extracting the record number of duplicates to a new file.  Once this is completed it joins the main file with the duplicate record number file and removes all those record numbers.

Some limitations, I did this script fairly quicly so there is not much in error checking.  Also right now it assumes that the ID columns are numeric, if they are not that can be easily fixed but as it is now it will give an error.  Also depending on how large a file you have and the power of your computer it might take some time to process.

So let me know how it works for you.

Brian

Pages