Hi, Getting back with more details about the problem from 21st of June. Around the interval: 2023-06-21T02:10:00.406+03:00 and 2023-06-21T10:38:26.406+03:00 most of the calls to the Sales Order Service API returned 500. We have restarted the cluster of the service and then calls got back to normal. While investigating the cause of this we noticed that the calls our service makes to another internal service to retrieve information about a tenant are failing with an Unauthorized exception. From the logs it seemed that Authorization header was not included in those calls. Restarting the cluster fixed this problem but we created a task to further investigate why the Authorization header was not included and another case to improve the code to be more resilient when making calls to this internal service. Another action point for us is to improve the monitoring. The current monitoring detected the errors but we did not have any alarms set for a high rate of them. We will implement a warning when the error rate reaches 20% and a critical alarm for Central Operations when the rate reaches 50%. This will allow us to respond faster when something similar might happen.
... View more