Skip to article frontmatterSkip to article content

More Database Concepts

For a business analyst, the world revolves around data. But data doesn’t simply exist in a void; it must be stored, organized, protected, and made accessible. Databases are the engines that perform these critical tasks.

To use a common analogy, if data is the new oil, databases are the storage tanks, pipelines, and refineries. They are the essential infrastructure that allows us to turn raw data into the refined fuel - reports, dashboards, and predictive models - that powers modern business decisions.

The specific type of database used by a company fundamentally impacts an analyst’s work: it dictates the structure of the data, the speed of our queries, and even the types of questions we can feasibly ask. This chapter explores the different types of databases and the core concepts that every analyst must understand.


🧭 The Two Primary Roles of a Database: OLTP vs. OLAP

Databases are highly specialized tools. In a business context, they are typically optimized for one of two very different jobs: running the business or analyzing the business.

🔁 Online Transaction Processing (OLTP)

An Online Transaction Processing (OLTP) database is designed to run the daily operations of a business. Its primary job is to process a high volume of small, concurrent transactions very quickly.

📊 Online Analytical Processing (OLAP)

An Online Analytical Processing (OLAP) database is designed specifically for business intelligence and analysis. Its purpose is to answer complex questions using vast amounts of historical data.

The critical takeaway is that a single database cannot excel at both jobs. This is why companies build data warehouses

Data Storage Architecture: Row-Oriented vs. Columnar The performance difference between OLTP and OLAP systems stems from how they physically store data on a disk.

Imagine a simple Sales table:

TransactionIDProductRegionSaleAmount
1AppleNorth1.00
2BerrySouth0.50
3AppleWest1.25

📦 Row-Oriented Storage (Optimized for OLTP)

Traditional databases store this data one row at a time.

(1, 'Apple', 'North', 1.00)
(2, 'Berry', 'South', 0.50)
(3, 'Apple', 'West', 1.25)

This structure is perfect for OLTP tasks. To retrieve a single order (SELECT * FROM Sales WHERE TransactionID = 2;), the database finds the start of that row and reads all its data in one efficient operation.

However, it is very inefficient for an analytical query like SELECT SUM(SaleAmount) FROM Sales;. To get the SaleAmount for every row, the database is forced to read all the other data (TransactionID, Product, Region) along with it, which wastes time and resources.

📑 Columnar Storage (Optimized for OLAP)

Modern analytical databases (like those in a data warehouse) store data one column at a time.

(1, 2, 3)
('Apple', 'Berry', 'Apple')
('North', 'South', 'West')
(1.00, 0.50, 1.25)

This structure is terrible for OLTP. To retrieve TransactionID = 2, the database must jump to four different places on the disk and “stitch” the row back together.

But for our analytical query (SELECT SUM(SaleAmount) FROM Sales;), this design is revolutionary. The database only reads the SaleAmount column, dramatically reducing disk I/O and accelerating analytical queries.

Modern data warehouses like Snowflake, Amazon Redshift, and Google BigQuery all use columnar storage under the hood.


🗄️ The Relational Model: SQL Databases

For decades, the dominant database model has been the Relational Database Management System (RDBMS). The language used to communicate with these databases is SQL (Structured Query Language). This is still the most common type of database an analyst will encounter. Unless you have a specific reason to use something else, SQL databases are the default choice.

Model

Data is stored in highly structured tables (rows and columns) with predefined relationships between them (e.g., a CustomerID in the Customers table links to the Orders table).

🔗 Relationships and Joins

Relational databases are “relational” because they connect entities via keys:

This structure enables JOINs, where data from multiple tables can be combined:

SELECT c.Name, SUM(o.Total)
FROM Customers c
JOIN Orders o ON c.CustomerID = o.CustomerID
GROUP BY c.Name;

⚙️ Indexing and Performance

Indexes work like a book’s table of contents: they let the database find rows faster. Analysts should understand that indexing speeds up queries but slows down inserts/updates - a tradeoff especially relevant in OLTP systems.

🧩 Normalization vs. Denormalization


🌐 The Rise of NoSQL: Handling the 3 Vs

In the 2000s, the web exploded. Companies like Google, Amazon, and Facebook faced data challenges that the traditional relational model couldn’t handle, often defined by the “3 Vs”:

This led to the creation of NoSQL (“Not Only SQL”) databases, a family of non-relational databases built for flexibility and massive scale.

🧩 Common NoSQL Types

🧮 SQL vs. NoSQL: A Comparison

NoSQL is not a replacement for SQL. They are different tools for different jobs. An analyst in a large company will almost certainly interact with both. Many modern systems are polyglot - organizations use SQL for transactions, NoSQL for scale, and even stream databases for real-time data (Kafka, Materialize, etc.).

FeatureSQL (Relational)NoSQL (Non-Relational)
Data ModelStructured (Tables, Rows, Columns)Flexible (Documents, Key-Value, Graph)
SchemaFixed (Schema-on-Write)Dynamic (Schema-on-Read)
Primary UseOLTP, Data Warehousing (ACID)Big Data, Web Apps, Unstructured Data
ConsistencyStrong (ACID)Eventual (BASE)
Query LanguageSQLVaries (e.g., MQL for Mongo)
ScalingVertical (Scale-Up)Horizontal (Scale-Out)

📈 Database Scaling Strategies

As data volume grows, a database must be able to scale to handle the load. There are two fundamental strategies for this.

⬆️ Vertical Scaling (Scale-Up)

➡️ Horizontal Scaling (Scale-Out)


⚙️ Distributed Processing: How Horizontal Scaling Works

Horizontal scaling requires a new way of processing data called distributed processing. The old model was to “move all the data to the code” (i.e., pull 10TB of data from the database onto your single analysis server). This is no longer possible.

If your data cannot fit into the memory of a single machine, you need to rethink your approach.

The new model is to “move the code to the data.” This is best explained by the Split-Apply-Combine (or MapReduce) paradigm.

Imagine you need to count all mentions of the word “apple” in a 10TB dataset, and you have a 10-server cluster.

  1. SPLIT: The system automatically splits the 10TB file into 10 1TB chunks and sends one chunk to each of the 10 servers.
  2. APPLY (Map): The system sends your “count apples” code to all 10 servers. Each server then counts the apples only in its own 1TB chunk simultaneously (in parallel).
  3. COMBINE (Reduce): A central “master” server doesn’t do the hard work. It just asks each server for its final, small answer (Server 1: “I found 500,” Server 2: “I found 300,” etc.) and adds them up.

This ability to split a massive job into small parallel tasks is the foundation of modern big data analytics frameworks like Apache Spark. Popular frameworks that use this model include:


Modern analytics pushes beyond batch processing:


✅ Summary: The Right Tool for the Job

There is no “best” database. The right choice always depends on the problem you are trying to solve. As a business analyst, your environment will dictate the tools you use: