Identifying Duplicates Effectively

7 posts / 0 new
Last post
Brian Element's picture
Brian Element
Offline
Joined: 07/11/2012 - 19:57
Identifying Duplicates Effectively

I found this interesting article by Dave Coderre on identifying duplicates.  Thought it might be interesting to some people.

https://www.linkedin.com/pulse/identifying-duplicates-effectively-david-...

omalt
Offline
Joined: 08/15/2018 - 08:37

Interesting read,  I am trying to match addresses in IDEA and have a query.  Im doing an address match and would like to cleanse certain words.  My thought was to use @remove, specify the string and then specify the words to remove but I dont think I can do that with multiple words.  I was planning to use | to seperate the words but I dont think IDEA recognises this.  Do you know anyway to remove multiple words with one expression?
example
In the supplier address the concatinated string reads 44AwesomeRoad and in the employee address the concatinated string is 44AwesomeRoadGreatTown.  I would want to remove the "GreatTown" portion of the string so I can match for duplicates.
Hope this makes sense

Brian Element's picture
Brian Element
Offline
Joined: 07/11/2012 - 19:57

Hi Omalt,

Have you thought of trying the fuzzy match option?  Here is a video by CaseWare on one way to use it: https://www.youtube.com/watch?v=Lu3mwVqE-G4&index=21&list=PLEE1l8LoXUCLS...

Sounds like you might be able to do this with Regular Expressions.  Steve is much better at this then I am so hopefully he will see this thread and respond.

Brian

omalt
Offline
Joined: 08/15/2018 - 08:37

Thanks, I had used the fuzzy match but i was finding it wasn't working for me.  I'll have a look at the video to make sure i was using it right but it was pulling out things that were clearly not close to being a match even though i had it on 95% accuracy.
I have now used a workaround using @Left then @simplesplit after @JustNumbers to only pull out the first 10 characters after the first numbers come up in the address field (they arent always at the start of the address field).   This works fairly well (isn't perfect - but acceptable).  Thanks i'll keep my eye out in case steve replies

Steven Luciani
Offline
Joined: 07/31/2012 - 13:20

Hi Omalt,

Working with addresses if very difficult. I have had some success with the fuzzy duplicates feature of IDEA and it's possible that the @regexpr function could help you. It will be difficult to ascertain 100% accuracy. If you are able to provide a file with the addresses or a subset of the addresses, I would take a look at them and be able to provide a better response.

Regards,

Steven

scotchy33
Offline
Joined: 09/05/2012 - 15:51

Hi Omalt,
You could try the attached script.  Also, it could be tweaked to use the first 10 characters or Levenshtein Distance degree of accuracy.
Maybe Brian could tweak it, as I am not as capable as him.
 
Scott
 

omalt
Offline
Joined: 08/15/2018 - 08:37

Thanks for the replies, i'm producing a working example of procurement risk modelling at the moment so I don't want to invest too much time into the problem right now (incase its not taken forward), but I think it is a topic ill need to revist in the future, either in this project or another one so I will come back to these replies and look at them in more detail when i revisit the topic.  Thanks all