- November 11, 2022
- Posted by: Indium
- Category: Data & Analytics
Databricks is a unified, open platform for all organizational data and is built along the architecture of a data lake. It ensures speed, scalability, and reliability by combining the best of data warehouses and data lakes. At the core is the Databricks workspace that stores all objects, assets, and computational resources, including clusters and jobs.
Over the years, the need to simplify Databricks deployment on AWS had become a persistent demand due to the complexity involved. When deploying Databricks on AWS, customers had to constantly between consoles as given in a very detailed documentation. To deploy the workspace, customers had to:
- Configure a virtual private cloud (VPC)
- Set up security groups
- Create a cross-account AWS Identity and Access Management (IAM) role
- Add all AWS services used in the workspace
This could take more than an hour and needed a Databricks solutions architect familiar with AWS to guide the process.
To make matters simple and easy and enable self-service, the company offers Quick Start in collaboration with Amazon Web Services (AWS). This is an automated reference deployment tool integrating AWS best practices to leverage AWS Cloud Formation templates and deploy key technologies on AWS.
Incorporating AWS Best Practices
Best Practice #1 – Ready, Steady, Go
Make it easy even for non-technical customers to get Databricks up and running in minutes. Quick Starts allows customers to sign in to the AWS Management Console and deploy Databricks within minutes after selecting the CloudFormation template and Region by filling in the parameter values required for the purpose and deploy. Quick Starts is applicable to several environments and the architecture is designed such that customers using any environment can leverage it.
Best Practice #2 – Automating Installation
Deployment of Databricks involved installing and configuring several components manually earlier. This is a very slow process, prone to errors and reworks. The customers had to refer to a document to get it right and this was proving to be difficult. By automating the process, AWS cloud deployments can be speeded up effectively and efficiently.
Best Practice #3 – Security from the Word Go
One of the AWS best practices is the focus on security and availability. When deploying Databricks, this focus should be integrated right from the beginning. For effective security and availability, aligning it with the AWS user management to allow one-time IAM will provide access to the environment with appropriate controls. This should be supplemented with AWS Security Token Service (AWS STS) to authenticate user requests for temporary, limited-privilege credentials.
Best Practice #4 High Availability
As the environment spans two Availability Zones, it ensures a highly available architecture. Add a Databricks- or customer-managed virtual private cloud (VPC) to the customer’s AWS account and configure it with private subnets and a public subnet. This will provide customers with access to their own virtual network on AWS. In the private subnets, Databricks clusters of Amazon Elastic Compute Cloud (Amazon EC2) instances can be added along with additional security groups to ensure secure cluster connectivity. In the public subnet, outbound internet access can be provided with a network address translation (NAT) gateway. Use Amazon Simple Storage Service (Amazon S3) bucket for storing objects such as notebook revisions, cluster logs, and job results.
The benefits of using these best practices is that creating and configuring the AWS resources required to deploy and configure the Databricks workspace can be automated easily. It doesn’t need solutions architects to undergo extensive training to the configurations and can be an intuitive process. This will help them remain updated with the latest product enhancements, security upgrades, and user experience improvements without difficulty.
Since the launch of Quick Starts in September 2020, Databricks deployment on AWS has become much simpler, resulting in:
- Deployment time takes only 5 minutes as against the earlier 1 hour
- 95% lower deployment errors
As it incorporates the best practices of AWS and is co-developed by AWS and Databricks, the solution answers the need of its customers to quickly and effectively deploy Databricks on AWS.
Indium – Combining Technology with Experience
Indium Software is an AWS and Databricks solution provider with a battalion of data experts who can help you with deploying Databricks on AWS to set you off on your cloud journey. We work with our customers closely to understand their business goals and smooth digital transformation by designing solutions that cater to their goals and objectives.
While Quick Starts is a handy tool that accelerates the deployment of Databricks on AWS, we help design the data lake architecture to optimize cost and resources and maximize benefits. Our expertise in DevSecOps ensures a secure and scalable solution that is highly available with permission-based access to enable self-service with compliance.
Some of the key benefits of working with Indium on Databricks deployments include:
- More than 120 person-years of Spark expertise
- Dedicated Lab and COE for Databricks
- ibriX – Homegrown Databricks Accelerator for faster Time-to-market
- Cost Optimization Framework – Greenfield and Brownfield engagements
- E2E Data Expertise – Lakehouse, Data Products, Advanced Analytics, and ML Ops
- Wide Industry Experience – Healthcare, Financial Services, Manlog, Retail and Realty
FAQs
How to create a Databrick in AWS?
In the free trial, you can sign up by clicking the Try Databricks button at the top of the page or on AWS Marketplace.
How can one store and access data on Databricks and AWS?
All data can be stored and managed on a simple, open lakehouse platform. Databricks on AWS allows the unification of all analytics and AI workloads by combining the best of data warehouses and data lakes.
How can Databricks connect to AWS?
AWS Glue allows Databricks to be integrated and Databricks table metadata to be shared from a centralized catalog across various Databricks workspaces, AWS services, AWS accounts, and applications for easy access.