For most straightforward monitoring that we need to do on our Sitecore instances we can use Application Insights service. It’s a service that will be installed with default Sitecore ARM templates and you should have it even if you did a custom ARM template installation.
Application Insights is an extensible Application Performance Management (APM) service. It will automatically detect performance anomalies, and includes powerful analytics tools.
Creating Monitoring Report
For your team to be aware of any issues that ocur you can create monitoring report. This will serve both you and any engineers involved to be aware of any issues with env. that you need to track. To create monitoring report steps are as follows:
- Go to desired Resource Group
- This will depend which env. you want to target (in most enterpise installation you will have multiple installations/env. of Sitecore installed – example as DEV, QA, PROD etc.)
- In Azure Resource Group that you are targeting search for “-ai” or, if you didn’t follow any naming convention filter resources by “Application Insights“
- This will filter all the resources and you will be presented with only Application Insights service
- Click on that Resource to open the Overview blade
- On the blade toolbar section click on “Search”
New blade has opened up, so you can input your parameters here. Parameters we need for Monitoring Report are as follows:
- Local time: Choose desired time range (NOTE: for monitoring you would in most cases want Last 24h)
- Event Types: Choose Exceptions
On the reported list click Grouped Results
Copy this data into your monitoring report. Results will show:
- Total Count of Errors in number and percentage
- Common Properties of errors (There can be multiple errors that differ only on ID’s, but at the end error is the same)
Copy generated chart in this blade to your monitoring report. Chart will show:
- Errors count based on timestamp (engineer can see which time of day errors are happening and draw conclusions based on that)
Monitoring Report should consist of:
- errors/exception (grouped, count numbers, details)
- failed requests (requests in high-volumne number)
- any long-term or consistent spikes in any of the services (app services, databases etc. )
In a case that engineer wants to do live monitoring of certain env., it can do so using Application Insights. Certain areas to focus on are as follows:
Live Metrics provides engineer with a realtime (couple of seconds of fallback) stream of data of application that AI is connected to. What we are more interesting in is:
- Incoming Request Count
- High Requests count number can indicate high-load on Content Delivery services on the Resource
- This can indicate high number of users on site, load testing, DDOS etc.
- Request Failure Rate
- High Request Failure count is not acceptable under any conditions and indicates application instability
- Live Telemetry
- For Monitoring part you can safely ignore Trace documents. On Troubleshooting mode you can monitor live excerption documents.
For Monitoring part you can track and isolate services which have high number of Requests or Failed Requests and CPU Avg value, alongside with Memory usage
Failure blade will show you all the captured failed operation on the application AI.
NOTE: Dont confuse this information with exceptions. Failure blade will show failed requests/operations on the application.
Cause of this failed requests can be:
- connected to certain exception (error in application, bug in code etc.)
- missing content (image, content file etc.)
- service is stoped or in process of being restarted
High number of Failures can indicate significant problem in application and should be dealt with.
Database monitoring can be done via seperate Dashboard. In Azure on the left blade choose “Dashboard“. This option will open up a window with couple of pre-configured dashboards to chose from. Here you can create a seperate dashboard and add Database Monitoring block which will give you information on DTU usage, load, deadlocks etc.
Via this Dashboard engineeer can track and monitor each database for the application (doesn’t matter if database is in pool or not). Areas to monitor for databases are:
- CPU Usage
- High CPU Usage over significant time period should be dealt with and would indicate some “under-the-hood” problems.
- Temporary spikes in CPU Usage is normal due to certain operations load (Publishing etc.)
- Deadlock Count
- High number of deadlock should be dealt with and would indicate some “under-the-hood” problems.
- Failed Connections Count
- High Number of failed connections could indicate database has droped out of operation
- DTU Usage
- High DTU Usage over significant time period should be dealt with and would indicate some “under-the-hood” problems.
- Temporary spikes in DTU Usage is normal due to certain operations load (Sync, Publishing etc.)
Especially in last couple of notes here, our first task shouldn’t be to scale out any part of application, either horizontaly or verticaly, but to investigate any issues or high usage and either deal with the problem or optimize certain area of application. Scale out or scale up of App Service should be our last step.