Author Information

Brian Element

Offline

Last seen: 10 hours 14 min ago

Joined: 07/11/2012 - 19:57

Identifying Duplicates Effectively

7 posts / 0 new

Last post

Sat, 01/05/2019 - 07:30

Brian Element

Offline

Joined: 07/11/2012 - 19:57

Identifying Duplicates Effectively

I found this interesting article by Dave Coderre on identifying duplicates. Thought it might be interesting to some people.

https://www.linkedin.com/pulse/identifying-duplicates-effectively-david-...

Tue, 02/12/2019 - 09:48

omalt

Offline

Joined: 08/15/2018 - 08:37

Interesting read, I am trying to match addresses in IDEA and have a query. Im doing an address match and would like to cleanse certain words. My thought was to use @remove, specify the string and then specify the words to remove but I dont think I can do that with multiple words. I was planning to use | to seperate the words but I dont think IDEA recognises this. Do you know anyway to remove multiple words with one expression?
example
In the supplier address the concatinated string reads 44AwesomeRoad and in the employee address the concatinated string is 44AwesomeRoadGreatTown. I would want to remove the "GreatTown" portion of the string so I can match for duplicates.
Hope this makes sense

Tue, 02/12/2019 - 10:08

(Reply to #2) #3

Brian Element

Offline

Joined: 07/11/2012 - 19:57

Hi Omalt,

Have you thought of trying the fuzzy match option? Here is a video by CaseWare on one way to use it: https://www.youtube.com/watch?v=Lu3mwVqE-G4&index=21&list=PLEE1l8LoXUCLS...

Sounds like you might be able to do this with Regular Expressions. Steve is much better at this then I am so hopefully he will see this thread and respond.

Brian

Tue, 02/12/2019 - 10:25

omalt

Offline

Joined: 08/15/2018 - 08:37

Thanks, I had used the fuzzy match but i was finding it wasn't working for me. I'll have a look at the video to make sure i was using it right but it was pulling out things that were clearly not close to being a match even though i had it on 95% accuracy.
I have now used a workaround using @Left then @simplesplit after @JustNumbers to only pull out the first 10 characters after the first numbers come up in the address field (they arent always at the start of the address field). This works fairly well (isn't perfect - but acceptable). Thanks i'll keep my eye out in case steve replies

Wed, 02/13/2019 - 08:03

Steven Luciani

Offline

Joined: 07/31/2012 - 13:20

Hi Omalt,

Working with addresses if very difficult. I have had some success with the fuzzy duplicates feature of IDEA and it's possible that the @regexpr function could help you. It will be difficult to ascertain 100% accuracy. If you are able to provide a file with the addresses or a subset of the addresses, I would take a look at them and be able to provide a better response.

Regards,

Steven

Wed, 02/13/2019 - 14:05

scotchy33

Offline

Joined: 09/05/2012 - 15:51

Hi Omalt,
You could try the attached script. Also, it could be tweaked to use the first 10 characters or Levenshtein Distance degree of accuracy.
Maybe Brian could tweak it, as I am not as capable as him.

Scott

Files:

ISINI Join v1.iss

Thu, 02/14/2019 - 06:00

omalt

Offline

Joined: 08/15/2018 - 08:37

Thanks for the replies, i'm producing a working example of procurement risk modelling at the moment so I don't want to invest too much time into the problem right now (incase its not taken forward), but I think it is a topic ill need to revist in the future, either in this project or another one so I will come back to these replies and look at them in more detail when i revisit the topic. Thanks all

Main menu

You are here

Author Information

Identifying Duplicates Effectively