MemberJanuary 3, 2024 at 2:51 am
I would like to discuss the answers of the following question:
A media company processes and converts its video collection using the AWS Cloud. The videos are processed by an Auto Scaling group of Amazon EC2 instances which scales based on the number of videos on the Amazon Simple Queue Service (SQS) queue. Each video takes about 20-40 minutes to be processed.
To ensure videos are processed, the management has set a redrive policy on the SQS queue to be used as a dead-letter queue. The visibility timeout has been set to 1 hour and the
maxReceiveCounthas been set to 1. When there are messages on the dead-letter queue, an Amazon CloudWatch alarm has been set up to notify the development team.
Within a few days of operation, the dead-letter queue received several videos that failed to process. The development received notifications of messages on the dead-letter queue but they did not find any operational errors based on the application logs.
Which of the following options should the solutions architect implement to help solve the above problem?
- Reconfigure the SQS redrive policy and set maxReceiveCount to 10. This will allow the consumers to retry the messages before sending them to the dead-letter queue.
- Configure a higher delivery delay setting on the Amazon SQS queue. This will give time for the consumers more time to pick up the messages on the SQS queue.
- Some of the videos took longer than 1 hour to process. Update the visibility timeout for the Amazon SQS queue to 2 hours to solve this problem.
- The videos were not processed because the Amazon EC2 scale-up process takes too long. Set a minimum number of EC2 instances on the Auto Scaling group to solve this.
The correct answer is set to be the number 1. But it doesn’t really make sense to me, especially with this piece of information:
The development received notifications of messages on the dead-letter queue but they did not find any operational errors based on the application logs.
If it was indeed because one instance failed to process the image, we would then be able to see an error. In that case I believe that the number 3 answer is more appropriate and seems more likely. If there is no error, then the image has indeed been eventually processed, meaning it took longer that 1 hour. Allowing another instance to re-process in this case would be wrong as it would be processed twice, or even more if it takes again more than an hour.
I’m interested to hear your comments on this.
AdministratorJanuary 15, 2024 at 5:36 pm
Thank you for your thoughtful analysis, and we completely understand your concern regarding the absence of errors in the application logs.
Aside from some video processes going over 1 hour, other situations cause errors to not show up in the application logs. For instance, let’s say the application uses the AWS SDK for calling the ReceiveMessage SQS API. Behind the scenes, the SDK may retry failed API calls several times until it succeeds. These ‘retries’ won’t be recorded in the logs but it will count towards the maxReceiveCount. This, in turn, will cause SQS to move the message to DLQ without corresponding log entries for operational errors.
With the way the scenario is worded right now, option 3 and option 1 are both plausible solutions. That being said, we understand that there’s a need to further clarify the conditions in the scenario. We’ll review this question and make the necessary adjustments.
Let me know if this helps.
Neil @ Tutorials Dojo
- This reply was modified 1 month, 2 weeks ago by Neil-TutorialsDojo.
Log in to reply.