Question on Redshift Troubleshooting

Tagged: Architect, aws, Certified, Professional, Solutions

Question on Redshift Troubleshooting

Jon-Bonso updated 3 years, 11 months ago 2 Members · 4 Posts
AWS Certified Solutions Architect Professional
varun-mathur

Member
May 8, 2020 at 5:53 pm

The answer to the following question lists reducing MTU size as a valid troubleshooting step. However, the question states that the problem appears after a few days of QA testing. That points to memory/tablespace/locking issues. An MTU problem on the other hand, should have been evident from the beginning, so I think this option needs to be reviewed again.

The question:

You are working as an IT Consultant for a FinTech startup based in Bonifacio Global City where you are tasked to properly set up an online analytical processing (OLAP) application. You have also launched and configured all of the required AWS resources such as EC2, Security Groups, Redshift WLM Queues, S3, and IAM Roles. The development team has completed their coding and deployed the new application in AWS. However, after a few days, the QA team noticed that queries stop responding at all in Redshift.

In this scenario, which of the following can you do to solve this performance issue? (Choose 3)
Jon-Bonso

Administrator
May 8, 2020 at 6:09 pm

Hi Varun,

Not quite. An MTU problem does not always show up evidently from the beginning. It depends on various factors.

The scenario basically asks you how can you troubleshoot the issue in your Redshift Cluster. All of the three answers here are based on the official AWS documentation:

https://docs.aws.amazon.com/redshift/latest/dg/queries-troubleshooting.html#queries-troubleshooting-query-hangs

About the maximum transmission unit (MTU) size, this issue happens due to packet drop, when there is a difference in the MTU size in the network path between two Internet Protocol (IP) hosts.

In my opinion, it depends on the packet size. If on those few days, the requests only have a small packet size, then there will not be a problem and the Redshift cluster will work as expected. However, the problem will transpire if a host sends a packet that is bigger than the MTU of the instance, as per the official AWS documentation:

If a host sends a packet that’s larger than the MTU of the receiving host or that’s larger than the MTU of a device along the path, the receiving host or device returns the following ICMP message: Destination Unreachable: Fragmentation Needed and Don’t Fragment was Set (Type 3, Code 4). This instructs the original host to adjust the MTU until the packet can be transmitted.

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/network_mtu.html#path_mtu_discovery

Hence, the option that says: “Reduce the size of maximum transmission unit (MTU).” is still valid for this scenario.

The MTU size determines the maximum size, in bytes, of a packet that can be transferred in one Ethernet frame over your network connection. If the packet size is small on those first few days (perhaps used for initial testing) then it is possible that it will still work. But if it is used fully, then the packet size being sent by the host could be doubled that causes this issue.

This issue is also mentioned here: https://docs.aws.amazon.com/redshift/latest/mgmt/connecting-drop-issues.html

Take note of the word: “Sometimes” :

Queries Appear to Hang and Sometimes Fail to Reach the Cluster

You experience an issue with queries completing, where the queries appear to be running but hang in the SQL client tool. Sometimes the queries fail to appear in the cluster, such as in system tables or the Amazon Redshift console.

Possible solution:

This issue can happen due to packet drop, when there is a difference in the maximum transmission unit (MTU) size in the network path between two Internet Protocol (IP) hosts. The MTU size determines the maximum size, in bytes, of a packet that can be transferred in one Ethernet frame over a network connection. In AWS, some Amazon EC2 instance types support an MTU of 1500 (Ethernet v2 frames) and other instance types support an MTU of 9001 (TCP/IP jumbo frames).

The official AWS documentation supports the provided answer but if you are not fully convinced, then feel free to do this on your own AWS account:

– Launch a single-node Redshift (dc2.large) instance then configure your SQL client tool to send and receive small packets by using a simple SELECT statement or so. And conversely, use a complex INSERT or COPY command that loads data into a table from a data file. That will shoot up the packet size and your request may fail with the following ICMP message:

Destination Unreachable: Fragmentation Needed and Don’t Fragment was Set (Type 3, Code 4).

This message instructs the originating host to use the lowest MTU size along the network path to resend the request. Without this negotiation, packet drop can occur because the request is too large for the receiving host to accept.

IMPORTANT REMINDER: A Redshift cluster is quite expensive so please terminate it once you are through testing.

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/network_mtu.html

I’ve developed and released a lot of enterprise applications for various clients. The incoming traffic on these apps widely varies and most of the time, they will reach the peak load after a few days. Multinational investment banks and other large enterprises usually deploy their applications on a weekend so it won’t affect their BAU operations. Say you deploy it on a Saturday and do some smoke testing on a Sunday. The issue might be discovered on Monday when all of the users are actively using the application.

Regards,

Jon Bonso
varun-mathur

Member
May 9, 2020 at 12:57 pm

Hi Jon,

thanks a bunch for attending to this. While I am with you on the ‘Sometimes‘ bit, I am still debating this part of the question: ‘ The development team has completed their coding and deployed the new application in AWS. However, after a few days, the QA team noticed that queries stop responding at all in Redshift.’

I could understand MTU surfacing after some time if you were only doing TCP L4 operations earlier like ping or trace etc.

Are TCP packet sizes likely to change at this stage?

With Regards.
Jon-Bonso

Administrator
May 9, 2020 at 5:06 pm

Hi Varun,

Yes, the TCP packet size could change depending on various situations. The main idea behind this scenario is to properly troubleshoot the Redshift issue. In the scenario, it says that the development team completed their coding and deployed their new application in AWS. It never said that they have done load testing for their application.

One key phrase here is that “all queries stop responding in Redshift”, which alludes to an issue in the database/data warehouse -tier and not on the web tier. Therefore, the issue here is not about “simple pings or traces” to the web server, but simple SELECT statements to Redshift, each with a small packet size. On those few days that the application was running, the Redshift cluster only received simple SELECT statements with small packet sizes. As the load spikes up, the number of complex, and packet-heavy, SELECT statements or INSERT commands, are received by Redshift. If the MTU is not properly configured, then the query could hang or dropped.

Just as mentioned above, you can test this out by launch a single-node Redshift (dc2.large) instance then configure your SQL client tool to send and receive small packets by using a simple SELECT statement or so. And conversely, use a complex INSERT or COPY command that loads a large amount of data into a table from a data file. That will shoot up the packet size and your request may fail. This is the same thing that is happening in the scenario.

Regards,

Jon Bonso

Viewing 1 - 4 of 4 replies

Question on Redshift Troubleshooting

varun-mathur

Jon-Bonso

varun-mathur

Jon-Bonso