SQL | Using the SQL ‘Random’ function to scramble sensitive data
A. What is Data Scrambling?
Data scrambling is one of those requirements that only comes up occasionally, but when it does come up, it is considered a ‘big deal’. The system owners have requested that the more confidential data in the system should not be visible in non-Production environments. Therefore, a request is made to ‘scramble’ the data of any key entities, such as employees, students or customers. The scramble routine needs to update each row with a randomly selected piece of data from another row. As a result, a developer or a systems tester can no longer identify real people from the scrambled data.
The solution below will perform a data scramble of employee last names. The routine has been written in such a way that other key fields in the table will not matter. In this case, the PS_NAMES table is keyed on ‘Employee ID’, ‘Name Type’ and ‘Effective Date’. However for the script, only the one primary key matters. The script ensures that a single EMPLID will end up with the one name across all name types and effective dates.
Although we are focusing on last name only, the same scrambling routine could be used for any confidential information in the system, such as tax numbers, national identity numbers and birthdates.
B. The SQL ‘Random’ Function
By taking advantage of the Oracle SQL ‘random’ function, data scrambling can be a relatively straightforward task. Depending on the size of your database, the entire process need not take any longer than ten minutes per field.
The main driver of the scrambling routine is the Oracle random value function: dbms_random.value(). This returns a random number between 0 and 1 with an incredible 38 decimal places. This can be easily tested with the following statement:
select dbms_random.value() from dual;
This SQL produces a result as follows:
To use this number to select a random row of data from a table, you can multiple the random number with the count of records in the table:
select round(dbms_random.value() * (select count(*) from ps_names)) from dual;
Note the use of the ‘round’ function to provide an integer value:
Finally, we will take advantage of the Oracle ‘rownum’ construct. When selecting data from a table, ‘rownum’ gives you the row number of the current row as part of the results data-set. This provides a quick way to generate a unique ID for each row of data.
C. The Data Scrambling Script
Tying all the above together, we can now create a SQL script which will set a field to a random value across the entire table. This examples scrambles the ‘LAST_NAME’ field in the ‘PS_NAMES’ table:
-- create temp table to hold all existing Last Name values create table temp_scramble as select rownum as row_seq, last_name from ps_names; -- add index on row sequence number create index temp_scramble1 on temp_scramble(row_seq); -- create temp table to hold all existing EMPLID values create table temp_emplid (emplid varchar(11), random_num integer, last_name varchar(30)); -- add index on EMPLID create index temp_emplid1 on temp_emplid(emplid); -- insert all current EMPLIDs into the temp table insert into temp_emplid select distinct emplid, 0, ' ' from ps_names; -- update the temp table with a random number update temp_emplid set random_num = round(dbms_random.value() * (select count(*) from temp_scramble)); -- update the Last Name with the matching record update temp_emplid a set last_name = (select last_name from temp_scramble where row_seq = a.random_num); -- update the real table with the randomly-selected Last Name update ps_names a set last_name = (select last_name from temp_emplid t where a.emplid = t.emplid); -- finally drop the temp tables drop table temp_scramble; drop table temp_emplid;
I went ahead and ran this script on a database with approximately 165,000 records in the PS_NAMES table. Execution time for the entire script was about 15 seconds.
See Also: