I wrote a basic NQE script to obtain the uptime for devices and display them in years, days and hours. When i reviewed the results I saw some super robust devices out there that had not been rebooted for several years.
This led to some discussion about whether uptime could be used in part to determine in device software upgrades were taking place regularly. As such we developed a really basic uptime check and threshold of around 6 months. Any devices that been up longer than 6 months would definitely have not been upgraded. (Our devices all reboot for software patches and upgrades).
This script is below
/*------------------------------------------------------------------------------------------------------------
* thresholdDays the number of days before uptime is not acceptable.
------------------------------------------------------------------------------------------------------------*/
thresholdDays = 183;
/*------------------------------------------------------------------------------------------------------------
* function: getEmptyReport
*
* Parameters: device: Device
*
* Response: dict
*
* {
* deviceName: string
* years: string
* days: string
* hours: string
* violation: bool
* violationReason: string
* }
*
* the default response is just the device name with the violation set as true, as this would represent
* any situation where the information in not available
------------------------------------------------------------------------------------------------------------*/
getEmptyReport(device) =
min(foreach x in [0]
select {
deviceName: device.name,
years: "",
days: "",
hours: "",
violation: true,
violationReason: "Information is not available for device."
});
/*------------------------------------------------------------------------------------------------------------
* function: getUptimeReport
*
* Parameters: device: Device
*
* Response: dict
*
* {
* deviceName: string
* years: string
* days: string
* hours: string
* violation: bool
* violationReason: string
* }
*
* provides a report showing the uptime for a given device, along with information in respect to whether the
* configured threshold is breached
------------------------------------------------------------------------------------------------------------*/
getUptimeReport(device) =
min(foreach x in [0]
let seconds = device.system.uptimeSeconds
let years = seconds / 31536000
let days = (seconds - years * 31536000) / 86400
let totalDays = seconds / 86400
let hours = (seconds - years * 31536000 - days * 86400) / 3600
let violation = if totalDays > thresholdDays then true else false
let violationReason = if violation
then "Uptime indicates that software upgrades may not have taken place on device for approximately " +
toString(totalDays) +
" more than the configured threashold of " +
toString(thresholdDays) +
" days."
else ""
select {
deviceName: device.name,
years: toString(years),
days: toString(days),
hours: toString(hours),
violation: violation,
violationReason: violationReason
});
/*------------------------------------------------------------------------------------------------------------
* main script
*
* each item in the list will contain data that matches the fields below
*
* {
* deviceName: string
* years: string
* days: string
* hours: string
* violation: bool
* violationReason: string
* }
*
* combines the two functions for getUptimeReport for devices that contain uptime information along with the
* output of getEmptyReport where uptime information is not available
------------------------------------------------------------------------------------------------------------*/
foreach device in network.devices
where device.platform.deviceType not in
[DeviceType.INTRANET,
DeviceType.INTERNET,
DeviceType.L2_VPN,
DeviceType.L3_VPN,
DeviceType.WAN_CIRCUIT,
DeviceType.MISSING_PEER
]
let report = if isPresent(device.system.uptimeSeconds)
then getUptimeReport(device)
else getEmptyReport(device)
select {
deviceName: report.deviceName,
years: report.years,
days: report.days,
hours: report.hours,
violation: report.violation,
violationReason: report.violationReason,
}
Caveats
This works for many devices types, but i noticed it doesn’t work correctly on F5 LTMs and Check Points. I’ve raised these as support cases as the data isn’t correct in the data model or unavailable.
It doesn’t mean upgrades have occurred if uptime is < 6 months, and indeed a device that reboots daily will have a low uptime. So other checks are needed for being able to measure how long since an upgrade has occurred.
Techniques
My usual output is to combine a default report, as generated by the function getEmptyReport with a report that contains output where available. The reason for this is that we will obtain a report for all devices whether or not the output is available. The default report is considered as a violation.
I filtered out some device types, such as intranet and internet since these would generate false positives in this output.
Maths - i have approximated the number of seconds in a year.
Future
Collecting this with a timestamp and complementing the output with current code level/patch level for a device will allow for this information to be used for reviewing code upgrades vs the length of time on that code level.
Hope people find this useful. Interested to hear other people ideas in this space.