Preventing Data Leakage

Data Leakage is a potential risk to an agency or entity from the exposure and unauthorized disclosure of FTI data. Data leakage is the transmission or exposure of data and information to an unauthorized or unintended recipient. The recipient may be internal to the agency, a known entity external to the agency, or an unknown entity external to the agency. The data leakage may not have been made intentionally or with any malicious intent. Regardless of whom has FTI access (employees, contractors, or vendor partners), or the form of the FTI data (digital media or printed format), protecting FTI data is important to the agency.

Data leakage is becoming more common throughout industry and government, leading to the development of software and procedural techniques to detect and prevent such occurrences. Research and guidance on data leakage benefit agencies in the following ways:

  1. Agencies can proactively detect and prevent the leakage of federal tax information (FTI) data, especially from sources that deviate from the authorized data flow model and
  2. The IRS Office of Safeguards can incorporate data leakage techniques in the Safeguards Review methodology, resulting in more effective detection and prevention of un-authorized disclosure of FTI.

IRS Publication 1075, Tax Information Security Guidelines for Federal, State, and Local Agencies (Pub 1075) states:

“To foster a tax system based on voluntary compliance, the public must maintain a high degree of confidence that the personal and financial information furnished to the Internal Revenue Service (IRS) is protected against unauthorized use, inspection or disclosure.

The IRS must administer the disclosure provisions of the Internal Revenue Code (IRC) according to the spirit and intent of these laws, ever mindful of the public trust. The IRC defines and protects the confidential relationship between the taxpayer and the IRS and makes it a crime to violate this confidence. The concerns of citizens and Congress regarding individual rights to privacy require the IRS to continuously assess disclosure practices and the safeguards used to protect the confidential information entrusted. While the sanctions of the IRC are designed to protect the privacy of taxpayers, the IRS recognizes the importance of cooperating to the fullest extent permitted by law with other federal, state and local authorities in their administration and enforcement of laws.”

Data leakage of FTI data from an agency puts confidence in the IRS at risk. Security industry analysis claims that more than 90% of data leakages are unintentionally caused by employees. As a part of the Safeguards program, the recipients of FTI data
should review their data handling procedures and implement the necessary processes to secure restricted information and eliminate potential data leakages.

Agencies that receive FTI data must have adequate protection programs in place. The transmission or exposure of restricted data and information to an unintended or unauthorized recipient, even if unintentional, may result in the unauthorized disclosure of FTI data.

Data leakage is often confused with data loss. Data loss implies that the data no longer exists or is corrupted beyond use. When data leakage occurs, the data still exists and is unaffected.

Data leakage can come from many sources, both in printed format and digital media.


Format Description Example
Printed Includes information used in documents such as case files that are left unattended, misplaced or provided to unauthorized recipients. Printed format may also include data that is used improperly in a document. A common example of printed format data leakage is when documents are left on shared printers.
Digital Media Can be categorized into three forms: data in motion, data at rest and data at the endpoints. All three forms of digital media are vulnerable to data leakage.  
Data in Motion Refers to data that is moving through a network, including wireless transmission. Because data in motion includes information in e-mail traffic, application traffic and peer-to-peer
sessions, data often goes unmonitored as it leaves the agency’s network. FTI data may be copied into an e-mail or sent in an attachment outside the agency to an individual or an entire mail-list.
Wireless transmission – email traffic, application traffic and peer to peer sessions.
Data at Rest  Refers to data that resides in databases, file systems, and other structured storage methods, (i.e. Oracle and SQL databases, and application data files). Because restricted data can be copied into unrestricted storage areas, the data may subsequently be used
unknowingly by employees.
Oracle and SQL databases, and application data files.
Data at the Endpoints Refers to endpoints of a network where the data is being used. Since this is where most FTI data is accessible, this type of digital media form warrants the greatest concern for potential data leakage. Employees or contractors could copy restricted data onto a mobile media device, physically remove it from the agency’s facility or print and distribute it
without knowledge of the violation. Consequently, FTI data that is on a PC could be copied to back-up storage located at a contractor’s off-site facility as a component of an agency’s automated back-up procedures.
Desktop hard drives, flash drives and other mobile media.


 Regardless of whether the data leakage is intentional or unintentional by employees,
contractors, or vendor partners or in the form of digital media or printed format, data
exposure can have a significant effect on government agencies.

Note: Agencies must implement the necessary controls to prevent FTI data leakage from occurring throughout their environment.

Most security controls are in place to protect an agency’s data from unintended
intruders, both in the physical and digital environments. The National Institute of
Standards and Technology (NIST) Recommended Security Controls for Federal
Information Systems, NIST SP 800-53, Revision 4, the Confidentiality and Disclosure of
Returns and Return Information, IRC 6103 (p), and the Pub 1075, Tax Information
Security Guidelines for Federal, State and Local Agencies, state what protections need
to be in place.

These documents address how to secure planned data transfers, such as back-up
tapes to contingency sites, and provide guidance for handling and storing data in printed
format to prevent its exposure. However, additional controls must be required to prevent
data leakage in digital media.

Each digital media form has a viable method for detecting data leakage. Most of these
methods are based on the inspection of the data content within a document as opposed
to the context of the data. This is analogous to opening an envelope and inspecting the
content of a letter for restricted data. This process is called file cracking.

File cracking retrieves content that may be many layers deep, such as an embedded
data table within a Word document that has been compressed or zipped. Data that is
encrypted can be retrieved with enterprise recovery keys. For content that cannot be
unencrypted and inspected, agencies should establish and enforce rules that can block
or quarantine the data for further review.

Four Common Methods of Detecting Unauthorized Data

• Rules-based expression analysis examines the content of a digital document
for specific rules or patterns (e.g., a nine-digit ID pattern (nnn-nn-nnnn) or
monetary format ($nn, within the document or within a specified
proximity of each other. This method is quick and effective at identifying welldefined
data structures within the document.
In addition, rules-based expression analysis is an effective method to identify
potential data leakages for data in motion. As data moves to a gateway, it can be
checked for restricted data and if identified, the data can be quarantined or

Data in use can also be analyzed at the endpoints by detecting restricted
information and preventing unauthorized copying onto hard drives, flash drives,
and other portable media.

  • Keyword filtering is similar to rules-based expression except a set of characters or words is used for comparison instead of searching for a data pattern. This is useful when identifying content associated with specific words or unique markers, such as ‘Top Secret,’ or inappropriate language. Key word filtering quickly reviews all three digital media forms as well as restricted data sets that may be constantly changing.
  • Exact data matching or database fingerprinting also identifies a selected
    set of data. Fingerprinting compares an existing data set against data from the
    restricted content to determine if there is an exact match. If restricted data is
    identified in the content, then that content will be further examined while content with acceptable data will be transmitted or processed. If only a subset of the data is restricted, only that data subset needs to be used for comparison. While this method is accurate and thorough, it can be slow. Fingerprinting works well with imagery such as photos, videos, and PDFs. Because of the latency period, this method is best for data at rest and data in use; however, fingerprinting may also be used with data in motion if a transmission delay is acceptable.
  • Partial data is another method that compares restricted data to content data matching. This method can also be used to look for complete data matches in the content. The restricted data is split into small sections and stored. A similar size portion of the content is then hashed. The content is then offset by a few characters and again split into small sections. Both hashed results are compared to determine if they match.

This method works best when the number of restricted documents is limited or
similar content or phrases are found across several restricted documents. This
method can be used to assess all three digital media forms, though latency
issues may occur with data in motion.

The following table summarizes the effectiveness of each data detection method for
each digital media.

Method Data in Motion Data in Rest Data in Use Comments
Rules Based Expressions Very Good Very Good Good Effective for well-defined data structures
Keyword Filtering Very Good Very Good Very Good Effective for data sets that are constantly changing
Fingerprinting Good, can cause data latency Very Good Very Good Effective for imagery such as videos and photos
Partial Data Matching Good, can cause data latency Good Good Effective with a limited number of restrictive data sets


Regardless of the method used to identify restricted data such as FTI, data may not always need to be blocked or quarantined. An Access Control list, containing a white list and a black list, could be created and used for recipients with access privileges, and checked prior to allowing the restricted data to be sent or stored.

Encryption of data files and emails can be used to prevent unauthorized recipients from receiving restricted data. If a recipient does not have the authority to view or access restricted data, then they should not have the encryption key.

The Access Control list must be maintained and updated by regularly distributing new encryption keys. This prevents the transmission of restricted data to an incorrect or unknowing recipient.

Similarly, protective markings can be placed on restricted data so that recipients would have to have the same level of markings as the data before it can be accessed. This also prevents the data from being stored in unauthorized file locations or databases.

As a part of the Safeguards program, agencies must consider procedures for preventing data leakage. Regardless of the approach used to prevent data leakage, all mitigation strategies should be based on identifying a distinctive data element or sequence that is unique from other data, and comparing that element within the data environment.

However, when trying to protect FTI data, it is not always the data that is unique but rather the source of the data. For example, the agency could develop procedures to look for Social Security numbers (SSN) in e-mails and other correspondence, or the storing of SSNs and total income in data storage files, but these FTI data elements could be the exact same data elements that the agency has collected. The source of the data is not readily identifiable from the data elements.

Unique Markers

The detection of data leakage relies heavily on an agency’s ability to clearly distinguish FTI data from other data.

To identify the source of the data as FTI, agencies could create tags or unique markers that are entered into the data or augments the data. As soon as the data is received, it should be processed to add clearly identifiable unique markers that distinguish the source as FTI.

Data labels should be used to implement mandatory access controls, restricting access to authorized users. For databases and spreadsheets, unique markers can alter data file names, the column headers in the data files and the data elements in the data
files. All files could be renamed to start with the clear distinction of “FTI_”. For example, a file named “RETURNS_2008” would be renamed “FTI_RETURNS_2008”.

Similarly, within the data files, the column headers could be renamed to show the source as FTI, such as “SSN” and “TOTAL_INCOME” being renamed to “FTI_SSN” and “FTI_TOTAL_INCOME”. This could even be implemented on the data elements that are non-computational. For example, SSNs and ADDRESSES would be augmented to
“FTI_123456789” and “FTI_123 Main ST”.

At all times, unique markers should be used to identify any type of FTI data that is received from the IRS. Unique markers could be added to text document filenames identifying the source of the data as FTI. Likewise, the file could be renamed to start with the “FTI_” distinction. Headers and footers of documents could be augmented to include “FTI” notations. This is similar to classification markers such as SECRET, or SENSITIVE BUT UNCLASSIFIED on documentation. This process could also be used for presentation documents such as PowerPoint slides.

Data leakage protections must be implemented across all three forms of digital media: data in motion, data at rest, and data in use. The use of a common notation such as
“FTI” as the unique marker for all data will make the process similar for all three categorizations. 

For example, the Keyword Filtering method analyzes the content for a specific set of words. Since the letter pattern of “FTI” is unique, it is unlikely to identify false-positives in the content. Furthermore, the small size of the pattern will allow for faster review of the content.

Adding unique markers to FTI data and implementing employee awareness training will assist the agency in preventing data leakages. This is accomplished by:

  • Informing the user of the need to restrict the data, even if it is removed before usage,
  • Avoiding the commingling of the data with naming conventions that inform database administrators (DBA) of the data restrictions,
  • Displaying warnings that application developers program into their software, informing users of the data restrictions before the user can access the data and
  • Modifying command processes to identify and protect FTI data. For example, when a print request command is made for FTI data, the command process would ensure a cover page is also printed that covers the printout resting on the printer. This cover page would clearly state the sensitivity of the data, the need for protection, and the intended recipients’ telephone numbers and locations.

Although data detection methods have no effect on FTI data in print format, the use of a unique marker for all FTI data will distinguish the use of FTI data on documents. As stated previously, the Safeguards program already has controls in place to protect printed FTI data from data leakage. To ensure clear marking of FTI data on all printed formats, the agency must train their personnel to adhere to data protection controls.

With the inclusion of unique markers for FTI data, the use of automated tools to implement prevention controls and the implementation of training and awareness programs for all employees, the agency can safeguard against data
leakages.  Additionally, similar naming conventions should be put into place to protect the agency’s own information from data leakages.

When preventing FTI data from data leakages, it is not the data that is unique but the source of the data. Therefore, unique markers can be added to the data to identify it as FTI (i.e., in the data file names, the column headers in the data files, and the data elements in the data files). This would allow automated tools such as keyword filtering to identify the unique markers for FTI data in all digital media forms. Once identified, the restricted data would be blocked, quarantined, or allowed to pass through to reputable recipients.

Data should also be checked for FTI markers before data is copied to unrestricted formats or storage locations. The FTI markers inform users, such as DBAs and developers, of any data restrictions.

In addition to preventing data leakages for the Safeguard program, agencies should add their own unique markers to their data and protect their agency’s own information from data leakages by using similar methods and controls.


Additional information can be found in the following documents: