Azure Databricks

Jorge Correa
July 16, 2020

The amount of data that a single person generates every day is huge. By 2020, the total volume of big data around the world will increase from 4.4 zettabytes to 44 zettabytes. Think of some of the world’s biggest tech companies; a large part of the value they offer comes from their data, which they’re constantly analyzing to produce more efficiency and develop new products for. Here is where Databricks plays a big role.

What is Databricks?

It is an industry-leading, cloud-based data engineering tool used for processing and transforming massive quantities of data and exploring the data through machine learning models. Databricks is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts. It is a fast, easy, and collaborative Apache Spark-based analytics service for big data pipelines.

Azure Databrick builds on the capabilities of Spark by providing a zero-management cloud platform that includes:

Fully managed Spark clusters
An interactive workspace for exploration and visualization
A platform for powering your favorite Spark-based applications

Databricks is a premier alternative to Azure HDInsight and Azure Data Lake Analytics.

Programming languages

The common and most used programming languages are Python, R, and SQL. These languages are converted in the backend through APIs, to interact with Spark. In other words, it saves users the need to learn another programming language, such as Scala, for the sole purpose of distributed analytics.

Analytics platform

Azure Databricks comprises the complete open-source Apache Spark cluster technologies and capabilities. Spark in Azure Databricks includes the following components:

Spark SQL and DataFrames: Spark SQL is the Spark module for working with structured data. A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python
Streaming: Real-time data processing and analysis for analytical and interactive applications. Integrates with HDFS, Flume, and Kafka.
MLlib: Machine Learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.
GraphX: Graphs and graph computation for a broad scope of use cases from cognitive analytics to data exploration.
Spark Core API: Includes support for R, SQL, Python, Scala, and Java.

Data sources

Some data sources are directly supported in Databricks Runtime. Others require that you create an Azure Databricks library and install it in a cluster. Below you can see a list of the data sources supported:

Azure Blob storage
Azure Data Lake Storage Gen2
Authenticate to Azure Data Lake Storage using Azure Active Directory credentials
Azure Data Lake Storage Gen1
SQL Databases using JDBC
Azure Cosmos DB
SQL Databases using the Apache Spark connector
Azure Synapse Analytics
CSV files
JSON files
Image
Cassandra

Workspace

An Azure Databricks Workspace is an environment for accessing all your Azure Databricks assets. The Workspace organizes objects (notebooks, libraries, and experiments) into folders, and provides access to data and computational resources such as clusters and jobs. Above all, it offers an interactive workspace that enables data scientists, data engineers, and businesses to collaborate and work closely together on notebooks and dashboards.

Databricks Runtimes

Runtimes are an additional set of components and updates that ensures improvements in terms of performance and security of big data workloads and analytics. Azure Databricks offers several types of runtimes including:

Runtime: Includes Apache Spark but also adds several components and updates that substantially improve the usability, performance, and security of big data analytics.
Databricks Runtime with Conda: An experimental version of Databricks Runtime based on Conda. It provides an updated and optimized list of default packages and a flexible Python environment for advanced users.
Runtime for Machine Learning: Built on Databricks Runtime and provides a ready-to-go environment for machine learning and data science.
Databricks Runtime for Genomics: A version of Databricks Runtime optimized for working with genomic and biomedical data.
Databricks Light: The Databricks packaging of the open source Apache Spark runtime. It provides a runtime option for jobs that don’t need the advanced performance, reliability, or autoscaling benefits provided by Databricks Runtime.

File System (DBFS)

The Databricks File System (DBFS) is a distributed file system mounted into an Azure Databricks workspace and available on Azure Databricks clusters. It allows you to mount storage objects like Azure Blob Storage that lets you access data as if they were on the local file system. DBFS offers the following benefits:

Allows you to mount storage objects so that you can seamlessly access data without requiring credentials.
Allows you to interact with object storage using directory and file semantics instead of storage URLs.
Persists files to object storage, so you won’t lose data after you terminate a cluster.

In addition, you can access DBFS objects using the DBFS CLI, DBFS API, Databricks file system utilities (dbutils.fs), Spark APIs, and local file APIs.

Security and privacy

Security is always top of mind when it comes to data management and access. Below you can find the tools to securing your network infrastructure and data:

Access control: You can use access control lists (ACLs) to configure permission to access data tables, clusters, pools, jobs, and workspace objects like notebooks, experiments, and folders. Table, cluster, pool, job, and workspace access control are available only in the Azure Databricks Premium Plan.
Control of data retention: In order to comply with privacy requirements for your organization, you may occasionally need to purge deleted objects like notebook cells, entire notebooks, experiments, or cluster logs.
Securing access to data storage using Azure Active Directory credential passthrough: Azure Databricks supports a type of cluster configuration, called Azure Data Lake Storage credential passthrough, that allows users to authenticate to Azure Data Lake Storage from Azure Databricks clusters using the same Azure Active Directory identity that they use to log into Azure Databricks.

Also, the Data Integration Service uses token authentication to access Databricks. You can apply the following security types to data in the Azure Databricks environment:

Resources on ADLS are secured by OAUTH credentials.
Resources on WASB are secured by an account access key.

If you would want to learn more about how to securing your network infrastructure and data, you can visit this blog here.

Reasons to choose Azure Databricks

If you are looking for the best option for a collaborative, high-performing, secure and elastic data analytics platform, you may want to explore this tool.

Apart from multiple language support that this tool provides, it allows you to integrate easily with many Azure services like Blob Storage, Data Lake Store, SQL Database and BI tools like Power BI, etc. Azure Databricks can run up to 100x faster than Hadoop MapReduce when running in-memory, or up to 10x faster when running on-disk. It is a great collaborative platform letting data professionals share clusters and workspaces, which leads to higher productivity.

If you want to hear more, stay tuned for more blogs here, or connect with our specialists to get the benefits of Azure Databricks, you can reach out to us here.