In this article, I will be giving an overview of data integrity and the various types of integrity that are most commonly cared about.
Data integrity is the state where the data in the database is ensured to be completely accurate and consistent during the life-cycle of the database. Data integrity is a vital concept in database design and is important to implement in any system that involves the storage, processing, and retrieval of data. Without data integrity, the data analysts and data scientists will have no assurance that the data that they are using to develop models and visualizations are up-to-date and accurate.
The most common types of data integrity are entity integrity, referential integrity, and domain integrity. I will go into more detail about these types of integrity.
- Entity integrity is achieved when each row in a table has a unique identifier (or a primary key) to ensure that there is no duplicate data within the tables. Entity refers to anything that the table is recording information for. For example, our table can hold information for customer entities and can have columns such as id, phone number, email, etc. In this example, we would want each unique customer to only have 1 row in the table to avoid duplicated information. This is important because duplicated information could lead to contradicting information when updating tables. For example, if we had the user “Adam Scott” in the table twice and updated his phone number, this could lead to a case where only 1 of the rows in the table is updated with the new phone number. Hence, we have data integrity issues because now we are no longer able to pinpoint what his current phone number is.
- Referential integrity is achieved when the connections between related tables in the database is consistent and maintained through the use of primary and foreign keys. We can take an example of a customer and sales table. In this case, the customer table would be the parent table and the sales table would be the child table because a sale cannot exist with a corresponding customer. Therefore, we need to ensure that there is a foreign key that properly correlates to each sale in the sales table. When a customer is deleted from the customer table, we want all the corresponding sales for that customer to be removed from the sales table because they should not exist without a customer.
- Domain integrity is achieved when the contents of each column follow the proper constraints. For example, for a phone number column, we want a constraint that ensures that the values for the phone numbers are all numbers. If we see a value of “my number” in the phone number column, we know that we have a domain integrity issue because strings and characters should not be allowed in this column. Domain integrity can be achieved using predefined rules and constraints when constructing the tables.