Configure Regular Expressions

To make content matching easy to configure, filter, and assess, you can now use the Aperture regex builder to build basic and weighted regular expressions.
The regular expression builder in Aperture provides an easy mechanism for configuring basic and weighted regular expressions. A regular expression (regex for short) describes how to search for a specific text pattern and then display the match occurrences when a pattern match is found. With weighted regular expressions, each text entry is assigned a score, and when the score threshold is exceeded, such as enough expressions from a pattern match an asset, the asset will be indicated as a match for the pattern. You can use the expression builder to construct a basic or weighted data pattern expression, view matches, filter occurrences and weight thresholds, and assess match results to determine if the content poses a risk to your organization.
Custom data patterns cannot be disabled, they can only be deleted.
  • Define the data pattern settings.
    1. Select SettingsData PatternNew Data Pattern.
    2. Enter a Name for the data pattern.
    3. Enter a Short Label description for the data pattern that is 10 characters or less.
    4. Specify if the data pattern is Basic or Weighted.
    po-configure-data-pattern-both.png
  • Configure the regular expression.
    1. Enter one regular expression per line, up to 100 lines of expressions.
    2. (Weighted expressions only): Assign a score for each line entry between -9999 (lowest importance) to 9999 (highest importance) by entering the regular expression, the delimiter, and the weight score. You must enter a weight threshold score of one (1) of more.
  • (Optional) Customize your delimiter.
    By default, the delimiter for all weighted regular expressions is semicolon ( ; ). You can customize your delimiter to copy and paste existing expressions instead of entering them manually.
  • Select a Category to scan.
    If the uncategorized category is selected, Aperture will scan all assets in your sanctioned cloud apps to locate a match for the expressions and match results may take longer to display.
    Save your data pattern setting.
    WildFire and machine learning data patterns do not have occurrences to specify in Match Criteria.
  • Best practices for using regular expression matches.
    This brief tutorial lists the basic concepts and commonly used constructs that you can use to craft time-saving regular basic and weighted expressions in Aperture. The following terms are used in this tutorial:
    Term
    Description
    Literal
    A literal is any character you use in a search or matching expression, for example, to find dlp in Aperture, "dlp" is a literal string - each character plays a part in the search, it is literally the string we want to find.
    Metacharacter
    A metacharacter is one or more special characters that have a unique meaning and are NOT used as literals in the search expression, for example, the character < > (caret) is a metacharacter.
    Regular Expression
    This term describes the search expression data pattern that you will be using to search in Aperture.
    Escape Sequence
    An escape sequence is a way of indicating that you want to use one of the metacharacters as a literal. In a regular expression an escape sequence involves placing the metacharacter \ (backslash) in front of the metacharacter that you use as a literal, for example, if you want to find (dlp) in Aperture then use the search expression \(dlp\), and if you want to find \\file in the target string c:\\file then you would need to use the search expression \\\\file (each \ to search for a literal (there are 2) that is preceded by an escape sequence \).
  • Use data patterns.
    Use Aperture data patterns instead of regular expressions where possible. Data patterns are more efficient than regular expressions because the data patterns are tuned for accuracy and the data is validated. For example, if you want to search for social security numbers, use the US Social Security Number (SSN) data pattern instead of a regular expression.
  • Use regular expressions sparingly.
    Regular expressions can be computationally expensive. If you add a regular expression condition, observe the system for one hour for efficient performance. Make sure that the system does not slow down and there are no false positives.
  • Test regular expressions.
    If you implement regular expression matching, consider using a third-party tool to test the regular expressions before you enable the policy rules. The recommended tool is RegexBuddy. Another good tool for texting your regular expressions is RegExr.
  • Regular expression constructs.
    Aperture implements the Java regular syntax for policy condition matching. The following table provides some common reference constructs for writing regular expressions to match or exclude characters in content.
    Construct
    Description
    .
    A dot, any single character, except newline (line ending, end of line, or line break) characters.
    \
    Escape the next character (the character becomes a normal/literal character.)
    \d
    Any digit (0-9.)
    \s
    Any white space.
    \W
    Any word character (a-z, A-Z, 0-9.)
    \D
    Anything other than a digit.
    \S
    Anything other than a white space.
    [ ]
    Elements inside brackets are a character class (for example, [abc] matches 1 character [a. b. or c.]
    ^
    At the beginning of a character class, negates it (for example, [^abc] matches anything except (a, b, or c.)
    $
    At the end of a character class, or before newline at the end.
    +
    Following a regular expression means 1 or more (for example, \d+ means 1 more digit.)
    ?
    Following a regular expression means 0 or 1 (for example, \d? means 1 or no digit.)
    *
    Following a regular expression means any number (for example \d* means 0, 1, or more digits.)
    (?i)
    At the beginning of a regular expression makes it case-insensitive (regular expressions are case-sensitive by default.)
    ( )
    Groups regular expressions together.
    (?u)
    Makes a period ( . ) match to even newline characters.
    |
    Means OR (for example, A|B means regular expression A or regular expression B.)
  • Regular expression quantifiers.
    Quantifiers can be used to specify the number or length that part of a pattern should match or repeat. A quantifier will bind to the expression group to its immediate left.
    Quantifier
    Description
    *
    Match 0 or more times.
    +
    Match 1 or more times.
    ?
    Match 1 or 0 times.
    {n}
    Match exactly n times.
    {n, }
    Match at least n times.
    {n, m}
    Match at least n but not more than m times.
  • Regular expression delimiters.
    A delimiter is used to specify separate strings of data when configuring regular expressions. For example, you can configure a weighted regular expression using a delimiter to separate the string of text you are matching from the weight threshold value. If you have large amounts of existing expressions to match, you can customize your delimiter to copy and paste the expressions instead of entering them manually. A delimiter can be any non-alphanumeric, non-backslash, non-whitespace character. Common delimiters in Aperture include:
    Delimiter
    Note
    ;
    Semicolon — If the delimiter is not customized, the semicolon is the default delimiter in Aperture.
    :
    Colon.
    |
    Pipe.
    /
    Forward Slash — If the delimiter needs to be matched inside the pattern it must be escaped using a backslash ( \ ). If the delimiter appears often inside the pattern, it is a good idea to choose another delimiter to increase readability.
    +
    Plus — Include phrase for matching.
    -
    Minus — Ignore phrase for matching.
    #
    Hash — Can be used to denote a number.
    ~
    Tilde
    { } Curly
    Brackets are used to find a range of characters. Bracket style delimiters do not need to be escaped when they are used as meta characters within the pattern, but they must be escaped when they are used as literal characters.
    [ ] Square
    ( ) Parenthesis
    < > Caret
  • Use case: calculating a weighted regular expression.
    To reduce false-positives and maximize the search performance of your regular expressions, you can assign scores using the weighted regular expression builder in Aperture to find and calculate scores for the information that is important to you. Scoring applies a match threshold, and when a threshold is exceeded, such as enough words from a pattern are found in a document, the document will be indicated as a match for the pattern.
    For example, Joe is an employee at a water treatment plant and needs to compile use data on a proprietary pH additive that is used when source water arrives at the plant. If Joe initiated a regular expression search with just the term "tap water" thousands of match results display, as the matched tap water documents list the additive, but Joe is really searching for the first use of the additive, not every document the additive is listed in, making it difficult for Joe to find the usage data he needs.
    To get more accurate results, Joe can initiate a weighted regular expression to assign weight and occurrence scores to the expression, or indicate the information to exclude by assigning a negative weight value.
    Joe enters a negative weight value to exclude tap water and higher values to source water and the proprietary water additive. The results are filtered and counted to a more manageable list, meaning that a document containing 10 occurrences of water counts as one when all files and folders are scanned. This enables Joe to view the match results, adjust the totals for weight and occurrences, and calculate an adjusted score to determine if the content poses a risk to his organization.
    Example weighted regular expression scoring:
    Weighted Regex Item
    Occurrence
    Adjusted Occurrence Score
    Adjusted Total
    Water; 1
    50
    50 (1 Occurrence X 1)
    110 minus 100 for tap water = 10 regex weight
    IP pH; 2
    30
    60 (30 occurrences X 2)
    Tap Water; -10
    100
    -100 (10 occurrences x -10)

Related Documentation