Identify Blank / Repeat Fields

Background: 

This script will identify any blank fields or fields in which the information repeats.  The scrpt will rename the fields based on a prefix that the user gives.  It also write a text file to the working directory of the fields which should be kept.  The script will then create a new IDEA file which will list the fields that contain empty or repeating data and allows the user to decide if they wish to keep or delete the field.  The user would then run the script a second time in order to delete the choosen fields.

The script was originally created by Lee Cunningham of the Canada Revenue Agency and was modified by Steve Luciani and Paul Dawson, also of CRA.

I updated the script today as someone found that it will give an error if the database has a criteria in place, as I am not sure if there is away to remove a criteria I add a warning on one of the menus.

Documentation: 

Script: IdentifyBlank_RepeatFields.iss

Developed: Originally by Lee Cunningham, ECAS, Ottawa TSO, 613-688-734

Modified by Paul Dawson, Windsor TSO, May 25, 2007 and  Steve Luciani, London TSO, 519-675-3057 March 5, 2009

Script rewritten by Brian Element (brian.element@ideascripting.com – part of the GEL group) May 14, 2012

 

This IDEA script scans all the fields in an IDEA file. If a field contains two or more different values, the script adds the field name to a text file called KeepTheseFields.txt.  Fields that did not make it to the text file are unwanted fields that are either a blank field or a field with a repeat value. The script also earmarks unwanted fields by prefixing their field names with a user specified string. An additional file is created that will show the field that are either empty or repeats along with the content and a boolean field in which the auditor can select which fields to delete from the main file. 

When starting the script you will have the following options:

 

The first option allows you to select a file in which to look for empty or repeat fields.  The second option allows you to delete the fields from the file assuming that you have already run the first part.

By selecting the first option you will have the following dialog:

 

This dialog allows you to select the file you wish to perform the analysis on plus select the prefix for the fields that have been identified as empty or repeats.  The prefix must begin with a letter, any other selections will give you an error.  The script will identify all numeric, character, date and time fields that are empty or repeats.  The script uses the field statistics for the numeric, date and time fields so it will force a recompute of the field statistics prior to performing the analysis.  A separate file will be created which will show the name of the field that is empty or a repeat, the field contents if the field is a repeat, if not it will be blacnk and a third field that allows you to select the fields you wish to delete.

 

A history is also written to the selected file showing the fields that are empty or repeats.

Once the following information has been generated you can rerun the script and select the Delete Fields option, this will use the file that is generated above to delete the fields through the script.

You must select two files, the first file is the file that has the fields to be deleted, the second file is the file that was created in step 1 and show the fields to delete.  The script will go through the file and delete all selected fields from the ***-del fields file.

 

If the script completes properly you will get a message saying that the script ran properly.  The script also uses progress bars so you will be able to tell approximately where you are in the script, the progress bar is an estimate as some type of fields take longer to analysis then others.

Tags:

Comments

I have a file with 40 million records and 210 fields, how long will the program take to run?Thanks
 
Adam

Brian Element's picture

Hi Adam,

I have no idea how long it will take but it will mostly likely take a fairly long time depending on the power of your computer.  Scripts always run slower then if it was a native IDEA function because of the limitations of the macro language.  The script does use a percentage complete bar that should help you tell how long it will take, but you might want to run this overnight or on a weekend for this large of a file.

Brian