AWS Cloud Experienced Questions & Answers

Mayank C Koli
6 min readJan 20, 2022

Sharing some questions and answers which are mostly asked in interview and might help candidates who are appearing for the same position.

Q: How can you make your application scalable for a big traffic day?
A: Put the VM in the ASG and use the LB. or As per new guidance and best practice, we should pre-warm up the LBs and Changed to Scheduled Scaling of ASG. So that when traffic grows, our application is ready to response in less time. We should have light weight AMIs so that EC2 can be launched and we should use the RDS proxy which help to make connectivity with DB by sharing the connections pool with Client requests. And one more service we can use that is Run IEM (Infrastructure Event Management) to ensure it can handle high traffic before the big traffic day so that It launches more EC2s and DBs as well.

If we are going with K8 or microservices then we should know the scaling mechanism, connection currency and AWS api-gateway, Images smellers.

Q: How do you achieve DR for your cloud application?
A: I will replicate everything to other region but good answer would be .. there are different options to choose from depending RTO(Recovery Time Objective) and RPO (Recovery Point Objective)

Recovery Time Objective (RTO) is defined by the organization. RTO is the maximum acceptable delay between the interruption of service and restoration of service. This determines what is considered an acceptable time window when service is unavailable.
Recovery Point Objective (RPO) is defined by the organization. RPO is the maximum acceptable amount of time since the last data recovery point. This determines what is considered an acceptable loss of data between the last recovery point and the interruption of service.

Four ways we can have DR:
1. Backup & Restore: RTO/RPO hours — Lower priority use cases
2. Pilot Light: RTO/RPO 10 mins — Core services, start and scale resources after event
3. Warm Standby: RTO/RPO minutes — Business Critical services, Scale resource after event
4. Multi-site Active/active — Zero downtime, Near Zero downtime, Mission Critical services

Q: How do you secure your application on the cloud?
A: Average answer would be, use KMS, IAM and firewall for security.
— Explain them what they do rather than just saying service names.
— Take one app such as 3 tier app with EC2, or microservices running on kubernetes or serverless and explain in details.

For e.g. — As you mean that my application in serverless manner and all the rest apis are handled by Api-gateway and backend is handled by Lambda and it is going to different DBs which are in sync. I will use login authentication and authorization with help on incognito mode. Security the applicaiton on Lambda and Lambda will be having the Policies in which it will only able to talk to DBs like Read and write operations only and we can have cloudwatch and cloud trail enabled to catch any anonymous request and then take appropriate action on that client request. Also we can have AWS waf service in front of API-gateway which will scan the request and allow only filtered requests.

Another example would be Kubernetes, where applications are deployed as POD in clusters which is generally accessed by Admin or Power users. The pods will be storing and accessing the secrets, keys or certificates from AWS Secret manager or AWS KMS which is managed by AWS itself and only provided USER or GROUPS or ROLE’s ARN is mounted on pods as serviceaccount way. so that pod can talk to those secret services. We can have separate namespaces to divide the cluster with resource quota enabled for each namespace but for multi-tenant cluster. Use network policy to control pod traffic. Implement the RBAC for admin/developer/tester/etc. Do not run root as privileged user on container. Use OPA to enforce restrictions i.e. images from approved registries, namespace with correct labels.

Q: Describe an architecture your designed.
A: General tips — I used microservices design with third party API gateway with Lambda, Describe everything about it so that you can answer each question of the architecture. Although it is challenging to cost optimize the applications i.e EKS managed Node groups but there are ways to do it like aws cost explorer and third party APIs kubecost or cloudhealth by vmware and we can set Aws Cloudwatch Insight to know more insight about node and consumption of resources. Even we can replaces nodes with AWS Spot instances.

Q: What is the diff btw SQL and NoSQL DB?
A: SQL holds structured data and NoSQL holds unstructured data. You can define indexes and run queries in SQL. SQL is good for Banking/transactional system and NoSQL is best for Logging.
Go over basic properties, ACID vs CAP, different scaling behavior,

Q: What is Cloud Computing?
A: Cloud computing is the on-demand delivery of IT resources over the internet with pay-as-you-go pricing. Instead of buying, owning and maintaining physical data centers and servers, you can access technology services such as computing power, storage and databases on as-needed basis from a cloud provider like AWS.

Star (Situation/Task/Action/Result)—SBI(Situation/Behavior/Impact)

Q: Tell me about a time when you were faced with a problem that had a number of possible solutions. What was the problem and how did you determine the course of action? What was the outcome of that choices?

Amazon Leadership Principal -

Situation — We have 20 microservices running on-prem on PCF. PCF license needed to be renewed in 6 months. Leadership wanted the project to migrate to AWS before that to save the cost and increase agility.

Task — As a lead architect/devops, I was tasked to find out the suitable AWS solution, within the timeframe given.

Action — I researched possible ways to run Microservices on AWS. I narrowed it down to 3 options. run each microservice on vanilla EC2, or run on K8 using EKS or Serverless. I took one of the microservices and did POC on Vanilla EC2, EKS and Lambda-API Gateway. While they all did the job, I found that with EC2 I have to take care of making it HA by spinning multiple EC2 in multiple AZs, and there is overhead of AMI rehydration.
EKS seems to be a valid solution. However, given the traffic patterns, we have to pay more than necessary. There is also an overhead of training the team on K8.
Lambda-API gateway is inherently HA, scalable and pay what I use and no server to manage at all. This simplifies our day 2 operational overhead and let us focus on delivering business value.

Result — Based on all the POC data of performances, cost and time to deploy, I selected serverless solution. We converted rest of the microservices to Lambda and implemented in production within 3 months. It resulted in over 90% cost saving over EC2 and K8. I shared my project learnings with the teams and showed them how to code Lambda so they can utilize it as well. I got recognized by CIO for this effort.

Another STAR way or SBI way…….

Situation: I was in the shift and suddenly I got an alert regarding OTP service went down. For reference OTP service was using for generating Authentication of customers.

Task: I identified the issue by checking the application logs and failure of API calls that Application is running but not able to generate MPIN/TOP which led to an outage and Customers were not able to verify themselves. It was 100% business impact to the application.

Action: At first being SPOC person I went to the service now tool and raised the Severity 1 incident. So bridge call was opened by Service now team between owner, dev team, network team, DBA team and me. I explained everything that what is actual problem and we have big outage due to this. So possible solutions I provided that we should route the traffic to DR site SOUTH BEND(INDIANA) which was on standby and mean while we can sort out the issue in primary site in CHICAGO(Illinois) and raise the hotfix and will deploy in production and if it still does not work then we will rollback the application to previous stable version and will raise the bugfix for this.
This solution I mentioned which was quite good and everyone agreed.

Result: Well it got fixed and we re-deploy the application with all Testing and move the traffic to Primary.

Learning was — -
1. Are you enough capable to identify the issue at prompt. If yes, then it is Good if No,
2. Then identify the gaps like monitoring properly or logging.
3. How can we minimize the gap So that we can identify the issue at early stage.

Blameless Postmortem:
1. This is something we might miss during signoffs. And we should have documented everything.
2. Need to deploy the capacity planning.
3. Why outage happened and how can we minimize it in future.

--

--

Mayank C Koli

Tech Geek - Wants to share useful info in crisp n short