Essential Excel Concepts: Data Cleaning
Sep 8, 2024
4 min read
1
7
1
Before we do any analysis in Excel, we need to ensure the data is “clean.” By clean, I mean update data that has errors. Some errors include having duplicate values, extra spaces, inconsistent data types, and incorrect spelling. There are many other ways to clean data. However, we will start by learning the most common ways to clean data you will encounter in the business world.
One of the most common ways to clean data is removing duplicate values. Sometimes you will not be able to tell if your dataset has duplicate values at first glance—there could be thousands of rows of data in your data set. However, in our examples, it is clear just by looking at the table. Keep in mind that these tricks and techniques I show are intended to large datasets.
Let us explore the following table for our data cleaning examples.
Quickly scanning the table, we can see there are duplicate values. If the dataset/table was larger, it might be difficult to see all the duplicates. I will show a neat trick that will help to show duplicate values that I use daily.
Since our data is in a table (we will discuss tables in more detail in the next blog post), we can use the built-in sorting capabilities to identify duplicates.
By clicking the yellow highlighted arrows, we can see the sorting options.
When we apply the “Sort Ascending” option, our table looks as follows.
We can see the highlighted rows are duplicate values since the row values are identical.
Be careful—sometimes most of the data in the rows of your tables will be the same and you will think it is a duplicate value. However, there may be one column or attribute that is different. For example, you could have a record or row in your table that has 99/100 columns that are the same. However, if there is one attribute that is different, record is not a duplicate. Moral of the story--be careful.
Now we will remove the duplicates from our table. Note, we will not just delete the rows line by line manually. That will take too much time—especially if the dataset is large. Instead, we will use built-in tools Excel provides for us.
Under the Data tab in Excel, there is a remove duplicate option. Let us select our table as the range and delete some duplicates.
A window will pop-up asking us to select items to delete.
I typically just keep the default “Select All” selected and click “OK.”
Here are the results.
As we can see, our table no longer has duplicates.
Now will use the “Find and Replace” technique to correct spelling. I made some names incorrect in the table to help with this example. We can see Paul Reverest as value in our table. However, the name should be Paul Revere.
Let us change this. Note, I will not update all the other names that are incorrect. But I hope you can see the obvious ones!
Under the Home tab, you will see Find & Select as an option.
When we click to expand, we will see Replace as a choice. Click Replace to open the following dialogue box with our values entered already.
After we click Replace All, here is our update table with the correct spelling for Paul Revere.
Now we will move on the last data cleaning technique we should have in our Excel toolkit—Trimming.
We use trimming to remove unwanted spaces before and after values in cells. Sometimes, when we pull data from various data sources, the values will have extra spaces, which could affect our functions we apply. To get around this (or step one in troubleshooting!), we use trimming.
We will trim our Employee ID column and use a power technique to help with analysis as well. That technique is call using a Helper Column.
A Helper Column is simply a column that we add to our data set or table to put functions in. We will add a new, helper column to our table by right-clicking on a column to insert our new column.
This will add a column as follows.
We will use the TRIM ( ) function now to trim the values from the Employee ID column by clicking in the cell in Column 1 (we will rename to TRIM Employee ID).
Type in =TRIM ( ), which will prompt criteria from Excel to input in the function.
You will notice the text criterion needed. Here we will just select the cell we want to trim by clicking in the cell or entering the value for the cell number.
After we click enter, the table will automatically be updated with autofill. The result will look the same. However, the extra space will be removed. Note, we updated the name to Trim Employee ID in the table output below.
A tip that I use every day is to just copy the values from the TRIM function and replace the values into the column since the values are conditional based off the function criteria. In other words, if we click in the cell like below, we will not see the value, we will see the function with the relative cell value inside.
Here is the option to select when you right-click to paste the values in the column.
COOOOOOOOOOL