Incident and Alert Event Categories
Table of Contents
Expand all | Collapse all
Incident and Alert Event Categories
Lets see all the incident and alert event categories in Prisma SD-WAN.
Different types of events trigger alerts and incidents generated in the
system. These events are categorized broadly as hardware and software issues, device
related issues, peering issues, site level issues, tunnel issues, application
performance issues, and secure fabric link issues. Based on the type of event, these
issues may originate from the ION device or the controller.
Hardware and Software Issues
Hardware Issues—Hardware issues are raised by the device and deals
with device hardware issues.
-
Power supply unit failure (Incident—warning).
-
Memory DIMM failure (Incident—critical).
-
Disk SMART threshold exceeding (Alert).
-
Disk failure (IO error) (Incident—critical).
Software Issues—Software issues are raised by the device and deals with device
software issues. For example:
-
Software module fails to start or restart (Incident).
-
High CPU usage (Alert).
-
High disk usage (Alert).
-
Corrupt database (Incident).
Device Related Issues
Device level issues include offline devices or devices with missing or incorrect
interface configurations, missing or incorrect routing configuration, DHCP, port or NAT
configurations, or devices with offline status.
For example, when a device is in an offline state for an extended period of time, it
impacts the VPNs connected to other sites. For device-to-controller connectivity issues,
troubleshoot the device through the device toolkit, using utilities such as Ping and
Curl. You may also check device reachability issues using Ping or SSH.
Device Interface Issues—Interface issues are raised by the device and deals with
device interface issues. For example:
-
Interface down (Incident—warning).
-
Excessive errors on the interfaces (Alert).
Device Registration Issues—Registration issues are raised by the device or the
controller depending on the incident and deals with device registration issues. For
example:
-
Failure to establish CIC channel with the controller (Incident—warning).
-
Failure to retrieve inventory after repeated attempts (Incident—warning).
BGP Peering Issues
Routing level issues include routing protocol configurations issues, BGP neighbor
establishment issues, static routing or misconfigured static routes, or issues with
prefix learning or advertisement.
-
For BGP session establishment related issues, check if there is an incorrect AS# or global parameters, an incorrect BGP peer IP, or check if BGP multi-hop is required.
-
For BGP peer type issues (data center only), check if the right peer type is selected, core, edge, or classic peer type. The edge peer only learns prefixes.
-
For static routing related issues, check the configuration for administrative distance, next hop (interface, IP, self), local or global scope to block or allow the advertisement into the Prisma SD-WAN fabric.
-
For prefix learning and advertisement issues, check the route map configurations on Prisma SD-WAN and BGP peer devices, check for interactions with other routing protocols in the enterprise network, and check for split or no-split prefix scenarios.
This category of events deals with Border Gateway Protocol (BGP) peering issues in the
data center. For example:
-
Peering with a WAN edge router is down (Incident—warning).
-
Peering with a core router is down (Incident—warning).
-
Routes learned from WAN edge indicate private WAN is down for a branch site (Incident —warning).
Site Level Issues
Site level issues may impact a single site or all sites. Select a site with an issue and
quickly isolate site, link, or circuit issues. You can also isolate the issues through a
quick view of incidents for a site.
Site related issues include overall quality of a link, VPNs down or VPNs configured for
another site, connectivity issues, circuit issues, incorrect circuit definitions or
inaccurate characteristics for a circuit, bandwidth issues, device assignment for a
site, or issues with prefixes at a site.
A few other site level issues include scenarios where none of the applications are
prioritized. If no applications are prioritized, check QoS, and then check if LQM is
enabled or disabled. If the flow browser indicates that traffic is bridged, despite a
configured port and an existing policy, then the issue could be the result of an
interface without a circuit label.
The device or the controller, depending on incident specifics, raise site level category
of incidents and deals with site level connectivity issues. For example:
-
Direct Internet path unavailable at a branch site (Incident—warning).
-
Private WAN path unavailable at a branch site (Incident—warning).
-
All paths unavailable at a branch site (Incident—critical), raised by the controller.
-
Loss of connectivity with the device/site (Incident—critical), raised by the controller.
-
Path labels specified in policies not available/assigned for the site.
From Release 5.5.1, two new incidents are generated for site connectivity issues.
SITE_CONNECTIVITY_DOWN
When the site has lost connectivity with the controller and all of the remote branches or
data center, an incident is raised. The behavior is different for a branch and data
center.
-
At a data center site, this incident is raised when all the network secure fabric links are down.
- At a branch site, this incident is raised when all network secure fabric links are down and all devices at the site are disconnected from the controller for more than 10 minutes.
The incident is cleared when connectivity is re-established or if at least one of the
VPNs is up.
SITE_CONNECTIVITY_DEGRADED
The controller raises this incident when any one of the below conditions exist:
-
The Direct Internet or Direct Private WAN is down.
- At least one of the secure fabric links at the site is down.
- At least one of the service links is down.
This incident is cleared when all WAN paths are up.
Secure Fabric Link Issues
A secure fabric link is between a branch site and a data center or between two branch
sites. Palo Alto Networks enables virtual private network (VPN) overlays on all public
and private circuits between a branch site and a data center. VPN overlays between Palo
Alto Networks branch sites are disabled by default. You can selectively enable or
disable these VPNs on the Prisma SD-WAN web interface.
This category of events deals with virtual private network (VPN) link connectivity
issues. For example:
-
Warning incident is raised when all the VPN links from an active branch site for a given secure fabric link is down (Incident—warning).
-
Informational incident is raised if there are more than one VPN links from the active branch site for a given secure fabric link and at least one of the many links is up and at least one is down. (Incident—informational).
-
The following VPN link related incidents are aggregated as secure fabric link incidents (Incident—warning when secure fabric link is down and informational when secure fabric link is degraded):
-
NETWORK_VPNBFD_DOWN
-
NETWORK_VPNLINK_DOWN
-
NETWORK_VPNPEER_ UNAVAILABLE
-
NETWORK_VPNPEER_ UNREACHABLE
-
NETWORK_VPNSS_ UNAVAILABLE
-
NETWORK_VPNSS_MISMATCH
-
The following incidents indicate the underlay network connectivity for a given circuit is
down:
-
DEVICEHW_INTERFACE_DOWN
-
NETWORK_DIRECTINTERNET_DOWN
-
NETWORK_DIRECTPRIVATE_DOWN
Incidents raised for the corresponding secure fabric links that use the underlay
connectivity are suppressed.
When an administrator configures Admin Down for an interface, this
condition suppresses all the corresponding raised secure fabric link incidents and this
is displayed in the Reason field of the incident.
Service Endpoint Level Issues
Service endpoint level issues occur when a service link to a single endpoint goes down,
DEVICEHW_INTERFACE_DOWN incident is raised by an ION device. From Release 5.6.1, when a
DEVICEHW_INTERFACE_DOWN incident is raised by an ION device for at least two standard
VPN interfaces to the same service endpoint, the NETWORK_STANDARD_VPN_ENDPOINT_DOWN
summary incident is raised by the controller.
This incident is raised only when it is associated with
Endpoint destinations and not Peer
destinations.
The NETWORK_STANDARD_VPN_ENDPOINT_DOWN incident is not raised when:
-
The parent interface is in Admin Down state.
-
There exists a DEVICEHW_INTERFACE_DOWN incident on the parent interface.
-
The incident won't be raised when the parent interface does not have a WAN label attached.
-
The parent interface is in operationally down state.
-
The device is disconnected from the controller.
-
There exists a Layer 3 reachability issue on the parent interface.
The NETWORK_STANDARD_VPN_ENDPOINT_DOWN incident is cleared when there exists only one
standard VPN interface down incident and all other incidents are cleared.
Logical Interface Level Issues
The controller raises an incident when a parent interface is down and
suppresses the interface down incident raised by its logical interfaces.
The controller clears the incident only when the parent interface is up but
does not suppress the interface down incident if any one of the child interfaces is
down.