Data scrambling is the process to obfuscate or remove sensitive data like email address or password. This process is irreversible so that the original data cannot be derived from the scrambled data. Data scrambling can be utilized only during the cloning process. So there are two key points in data scrambling I think.

  1. Unable to read original data
  2. Not to lost the characteristics of original data

It is difficult to achieve point 2. without failing to do 1. Point 2 is important especially for OLAP tool. Since OLAP tool often extracts the statistics of original data such as sum, average, we don’t want to change these type of information so much. For example assuming we have below table.

name email age
Kai Sasaki [email protected] 27
Takeshi Goda [email protected] 12
Suneo Honekawa [email protected] 12
Nobita Nobi [email protected] 12
Doraemon [email protected] 88
Doraemon [email protected] 88
Shizuka Minamoto [email protected] 12

How can we do data scrambling? The good example I thought is

col1 col2 col3
aa [email protected] 1
ab [email protected] 2
ac [email protected] 2
ad [email protected] 2
ae [email protected] 3
ae [email protected] 3
af [email protected] 2

It keeps for example cardinality with original data. (But does not average or other statistics). How can we achieve this type of good data scrambling? I’m now investigating the algorithm and tools to be used.