To improve detection rates for sensitive data
in your organization, you can define the machine learning data pattern
as match criteria to identify sensitive assets in your cloud apps
and protect them from exposure. By default, the machine learning
category is always enabled and is applied to all your cloud apps,
but you can disable a machine learning data pattern.
SaaS
Security API uses supervised machine learning algorithms to sort
sensitive documents into Financial, Legal and Healthcare top-level
categories for document classification and categorization. These
top-level categories may contain documents that also classify into
sub-categories, such as a financial accounting document classifies
as a sub-category to the financial top-level category.
The
Palo Alto Networks Data Science team collects large numbers of documents for
each category that serve as the foundation for classification. The
labeled data is then split into train, test, and verify data sets.
The training data set is used to learn the classification model,
the testing data set was used to tune the model, and the verification
data set was used to evaluate the model.
Preprocessing the
labeled training data generates features and the feature text is
tokenized into n-gram words for processing to remove stop words,
special characters, punctuations, etc. The classifier converts the
features using a vector space model and generates a high-dimension
document-feature matrix that identifies significant features to
reduce the matrix dimension. For each significant feature, SaaS
Security API computes a term frequency-inverse document frequency
(TF-IDF) weight, and the weight is normalized to remove the effects
due to different document lengths. At the end of the data preprocessing,
labeled documents then transform into labeled feature vectors for
feeding into supervised machine learning algorithms.
Snippets
are
not supported for
machine learning data patterns; instead, the SaaS Security web interface
displays the keywords that were flagged because of the machine learning
data pattern.