Understanding ACID Properties in Databases: A Comprehensive Guide

I am student currently pursuing my Bachelor's in computer science. I have a keen interest in DevOps and LINUX related technologies
CONTENTS:
What is a transaction?
Lifespan of a Transaction
Nature of transaction
Atomicity
Isolation
Dirty Reads
Non-Repeatable Reads
Phantom Reads
Lost Updates
Read Committed
Repeatable Reads
Snapshot
Serializable
Consistency
Consistency in Data
Consistency in Read
Durability
What is a transaction?
A transaction is a collection of queries that operates as one unit of work. For example to deposit money from one account to another there will be three queries (Select,Update,Update) which is counted as one unit of work that is depositing money from account A to account B. That is even if a single query fails and others are run successfully the entire transaction will fail.
Lifespan of a Transaction
A transaction begins with a BEGIN keyword. All the queries that are successfully executed are not written to disk until a COMMIT is made. Let's try to understand it with our previous example of depositing money,take a look at Account table below
| account_id | balance |
| 1 | 1000 |
| 2 | 1000 |
BEGIN tx1;
SELECT balance FROM ACCOUNT WHERE account_id=1;
UPDATE ACCOUNT SET balance=balance-100 WHERE ID=1;
UPDATE ACCOUNT SET balance=balance+100 Where ID=2;
COMMIT;
Now until the commit keyword is encountered the changes are not return. It allows for safe rollbacks in case of failures like what if the the account with id 1 doesn't have sufficient balance for the deposit. There comes another keyword called rollback it does what it says, it rolls back,if you realise something is wrong just rollback and start again.
BEGIN tx1;
SELECT balance FROM ACCOUNT WHERE account_id=1;
UPDATE ACCOUNT SET balance=balance-100 WHERE ID=1;
UPDATE ACCOUNT SET balance=balance+100 Where ID=2;
ROLLBACK;
Now here comes a trade off between the available databases, some databases are very fast with commits than others like Postgres. Postgres secretly writes the changes to disks so when you say commit it is like yeah done already. Postgres assumes that you will commit. Making Postgres heavy at rollbacks cause than it has to undo the entire thing on disc. On the other hand some databases are fast with rollbacks cause everything is in memory.
Nature of Transactions
Usually Transactions are used to change and modify data. However, it is perfectly normal to have a read only transaction. Example, you want to generate a report and you want to get consistent snapshot based at the time of transaction. We will learn more about this in the Isolation section.
Atomicity
As we discussed earlier in a transaction all queries must succeed and all the queries results into one unit of work,this property is called atomicity. If one query fails while others succeed the transaction will rollback that is already established, even if all the queries were successful and the database crashes prior to a COMMIT the transaction should rollback.
Isolation
Isolation as the name suggests refers to keep one transaction unaffected from other transactions in the database. To understand isolation we first need to talk about read phenomenons. The read phenomenons we will talk about include Dirty Reads, Non-Repeatable Reads, Phantom Reads and Lost Updates.
Dirty Reads
Let's assume a transaction T1 is started which looks like below
| PID | QNT | Price |
| 1 | 10 | $5 |
| 2 | 20 | $4 |
BEGIN t1;
SELECT PID,QNT*Price FROM SALES;
SELECT SUM(QNT*Price) FROM SALES;
COMMIT;
Now let us say that after the first query runs successfully another transaction T2 starts which looks like this. Note that the transaction T2 is not committed.
BEGIN t2;
UPDATE SALES SET QNT=QNT+5;
In a dirty read the aggregate query will read the change made by the transaction 2 which is NOT COMMITTED. This should be avoided in all cases. The result which should have been $130 is now $155 which is not correct in context of transaction 1.
Non-Repeatable Reads
Non repeatable read is still better than a dirty and by default they are permitted by database engines as the default isolation level is set to Read Committed we will talk about that in later section on isolation levels. A Non-Repeatable Read allows to read committed changes from other transactions. To get a better understanding let's take the above example as well
| PID | QNT | Price |
| 1 | 10 | $5 |
| 2 | 20 | $4 |
BEGIN t1;
SELECT PID,QNT*Price FROM SALES;
SELECT SUM(QNT*Price) FROM SALES;
COMMIT;
BEGIN t2;
UPDATE SALES SET QNT=QNT+5;
COMMIT;
We get $155 when it should be $130 We did read a committed value, but it gave us inconsistent results. You may or may not want to avoid these based on your requirements, all the other phenomenons are like that except dirty reads, dirty reads are just bad.
Phantom Reads
To understanding we will take the same example[we are going to understand everything with examples ;) ].
| PID | QNT | Price |
| 1 | 10 | $5 |
| 2 | 20 | $4 |
BEGIN t1;
SELECT PID,QNT*Price FROM SALES;
SELECT SUM(QNT*Price) FROM SALES;
COMMIT;
Now another transaction T2 inserts another row just after the SELECT PID,QNT*Price query.
BEGIN t2;
INSERT INTO SALES VALUES(3,10,"$1");
COMMIT;
| PID | QNT | Price |
| 1 | 10 | $5 |
| 2 | 20 | $4 |
| 3 | 10 | $1 |
We get $140 when it should be $130 We read a committed value that showed up in our range query.
The Aggregate function SUM reads that extra row this is called a phantom read. It is not the same as a non-repeatable read. A non-repeatable read is when the same rows are read with changed values while a phantom read is when an additional row is read in the transaction.
Lost Updates
Consider that two transactions begin at the same time and transaction 2 is committed first, overwriting the update thus the UPDATE from t1 is lost.
| ID | QNT | Price |
| 1 | 10 | $5 |
| 2 | 20 | $4 |
BEGIN t1;
UPDATE SALES SET QNT=QNT+10 WHERE ID=1;
SELECT SUM(QNT*Price) FROM SALES;
COMMIT;
BEGIN t2;
UPDATE SALES SET QNT=QNT+5 WHERE ID=1;
COMMIT;
| ID | QNT | Price |
| 1 | 15 | $5 |
| 2 | 20 | $4 |
We get $155 when it should be $180 Our update was overwritten another transaction and as a result “lost”.
Now let's talk about how to fix those issues by different isolation levels.
Read Committed
Each query in a transaction only sees committed changes by other transactions. Read committed is the default isolation level for most databases. It disables Dirty Reads but can not avoid Non-Repeatable reads and Phantom Reads.
Repeatable Read
The transaction will make sure that when a query reads a row, that row will remain unchanged while its running. We will take Postgres as an example for implementation of the isolation levels discussed. What happens is that when a transaction begins with Read committed, if an update in the row is committed by another transaction it creates a new version of that row. Now the transaction will read from the version it was using which does not have the changes.
| PID | QNT | Price |
| 1 | 10 | $5 |
| 2 | 20 | $4 |
BEGIN t1 ISOLATION LEVEL REPEATABLE READ;
SELECT PID,QNT*Price FROM SALES;
SELECT SUM(QNT*Price) FROM SALES;
COMMIT;
BEGIN t2;
UPDATE SALES SET QNT=QNT+5;
COMMIT;
This time the result of the sum query will be $130 which is correct. The change is committed in the database but the transaction t1 is not aware of it as the isolation level is set to repeatable read.
NOTE: Postgres implements the repeatable read isolation by snapshot mechanism which makes the transaction safe from phantom reads also. But in other database engines this may or may not be the same. We will discuss snapshot in brief below.
Snapshot
Each query in a transaction only sees changes that have been committed up to the start of the transaction. It's like a snapshot version of the database at that moment. This makes the transaction safe from phantom reads and repeatable reads as well.
Serializable
Serializable isolation level, transactions are executed in such a way that they appear to run one after the other, i.e., in a serial manner. This isolation level prevents all the common concurrency issues such as lost updates, dirty reads, non-repeatable reads, and phantom reads. However, it can significantly impact performance because it requires more locking and can lead to more contention between transactions.

Consistency
Now that we have understood what is meant by atomicity and isolation, we can talk about consistency. In most of the cases the inconsistency arise due to those two atomicity and isolation. Consistency has to be ensured at two places CONSISTENCY in Data and another is CONSISTENCY in Reads.
Consistency in Data
Consistency in data is responsibility of user for example referential consistency (foreign keys). Consistency in data can be ensured with atomicity and isolation of the transaction. Consider the below example, there are two tables a Pictures Table and a Likes Table. Spot the inconsistency in data, the number of likes on picture 1 in the Pictures Table does not amount to the number of users that have liked picture 1 in Likes Table. Another example would be if there is an entry of a user that has liked a picture that no longer exists.
| PICTURE_ID | BLOB | LIKES |
| 1 | XX | 5 |
| 2 | XX | 1 |
| USER(PK) | PICTURE_ID(PK)(FK) |
| Jon | 1 |
| Edmond | 1 |
| Jon | 2 |
Consistency in read

If a transaction committed a change will a new transaction immediately see the change? In the given figure, A transaction writes a value X to the database and another transaction reads the value X, which sounds pretty obvious.
Once you have multiple worker nodes or read replicas with your primary database, the change might not have been propagated to those read replicas immediately. Thus making the read inconsistent or the better term for that is Eventually Consistent. Eventual consistency is tolerable as after some time the changes will be propagated to all the replicas.
This phenomenon affects the system as a whole and both Relational and No-SQL databases suffer from this.
Durability
Changes made by committed transactions must be persisted in a durable non-volatile storage.
WAL(Write ahead log)
Writing a lot of data to disk is expensive (indexes, data files, columns, rows, etc..). That is why DBMSs persist a compressed version of the changes known as WAL (write-ahead-log segments).
AOF( Append Only File)
AOF is a type of persistence technique used in Redis, a popular in-memory data store. , which is a logging mechanism that writes to a log file on disk every write operation performed on a Redis database. The log file is used to reconstruct the database in the event of a crash or failure.
Durability and OS Cache
A write request in OS usually goes to the OS cache. When the writes go the OS cache, an OS crash, machine restart could lead to loss of data. fsync OS command forces writes to always go to disk. fsync can be expensive and slows down commits, anytime you have to perform I/O on disk it is expensive that is part of the reason why many databases compromise durability for performance.


