Data Cleaning
Author: Hao Zheng
It is always amazing to see how insights can be generated by data scientists. However, in real life, not all collected data sets can be directly analyzed with existing tools, primarily due to the chaotic nature of data. Data scientists spend 80% of their time in simply cleaning up the data. Therefore, it is important to understand how to manage messy data in order to transform them into organized data that are ready for further processing.
In this Project, we will be introducing eight basic techniques that can be applied to most of the data sets that you will encounter. Our goal is to transform extremely messy data into less messy data. It is likely that, after applying these techniques, your data set will still not completely ready for further processing. This is because each data set has its unique problems, and you find yourself needing to adjust the application of these data cleaning techniques based on your data set’s problems. However, having these techniques in your back pocket will certainly be an asset as you prepare and clean your data.
We encourage you to create your own Jupytor notebook and follow along. You can also download this notebook together with any affiliated data in the Notebooks and Data GitHub repository. Alternatively, if you do not have Python or Jupyter Notebook installed yet, you may experiment with a virtual notebook by launching Binder or Syzygy below (learn more about these two tools in the Resource tab).
Business Problem
Why do we need to clean the data before applying our desired statistical analyses, when we are really interested in generating business insights? In order to answer this question, we need to understand that there is a gap between raw data and the insights that business managers are interested in. Let’s use the retail industry as an example. Questions that are usually asked in the retail industry include:
- Q1: Which geographic location provides us with the most profit per person?
- Q2: What type of customers generate the most revenue for us?
However, the raw data set can only record the transactions from customers enrolled in the business’ loyalty program. In the process of recording these transactions, we may miss some of the information that is required to solve the aforementioned problems, such as:
- The recorded transaction only included the name of the store, but you do not know where some of these stores are located.
- The customers were not willing to share certain personal information due to privacy concerns.
In such scenarios, if we attempted to directly apply our statistical methods to this raw data set, we would likely have peculiar results on our hands:
- Q1: The stores in Happyland (perhaps a community with only 1,000 residents) provides us with the most profit per person.
- Q2: The customers who were not willing to share their gender identity generate the most revenue for us.
The above two insights are only “correct” in a statistical sense. In a business context, they provide little to no value. For example, it would not make sense for us to open up more stores in the small, but profitable, community of Happyland, and knowing that those customers who care about their privacy are the ones who generate the most revenue for us does not provide a good indication of who they truly are. Therefore, we use the data cleaning process to help close the gap between the messy, raw data set and the insights desired by business managers.
Let’s walk through the data cleaning process using an ongoing example for the remainder of the Project. Here, we have a data set that captures the constuction developments that occur around different neighbourhoods. Can we determine which neighbourhood has the most “constr type 3” construction projects?
Data Set Check
import pandas as pd
import numpy as np
import warnings
# This step prevent warning from showing up for formatting purpose,but ignore this line while practicing yourself
warnings.filterwarnings("ignore")
mydata = pd.read_csv("nb0007_data/Building_Permits.csv")
mydata.head(1)
Permit Number | Permit Type | Permit Type Definition | Permit Creation Date | Block | Lot | Street Number | Street Number Suffix | Street Name | Street Suffix | ... | Existing Construction Type | Existing Construction Type Description | Proposed Construction Type | Proposed Construction Type Description | Site Permit | Supervisor District | Neighborhoods - Analysis Boundaries | Zipcode | Location | Record ID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 201505065519 | 4 | sign - erect | 05/06/2015 | 0326 | 023 | 140 | NaN | Ellis | St | ... | 3.0 | constr type 3 | NaN | NaN | NaN | 3.0 | Tenderloin | 94102.0 | (37.785719256680785, -122.40852313194863) | 1380611233945 |
1 rows × 43 columns
Before moving into data cleaning, we need to understand what is happening in the dataset so that we can know in which way the data set is “sick:. It is obvious that there are lots of NaN values existing in the dataset. We definitely don’t want NA values here otherwise we will be having “security concerning customer” issue. That extra space and and NA value would be what we want to look at first.
Here, we have our first glimpse of the data. Before moving into the data cleaning process, it is crucial to grasp what is happening in the data set in order to understand the ways in which our data set is problematic and thus needs to be cleaned. It is obvious that there are several “NaN” values existing in the data set above. We definitely do not want to have NA values in our data set, or else we may run into analytical issues later on when we try and generate business insights from this data set. Let’s start by tackling the extra spaces (Step 1) and the NA values (Step 2) in our example data set.
# get info about all columns
mydata.info()
# create some exploratory visualization
mydata.isna().sum().plot(kind='bar')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 198900 entries, 0 to 198899
Data columns (total 43 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Permit Number 198900 non-null object
1 Permit Type 198900 non-null int64
2 Permit Type Definition 198900 non-null object
3 Permit Creation Date 198900 non-null object
4 Block 198900 non-null object
5 Lot 198900 non-null object
6 Street Number 198900 non-null int64
7 Street Number Suffix 2216 non-null object
8 Street Name 198900 non-null object
9 Street Suffix 196132 non-null object
10 Unit 29479 non-null float64
11 Unit Suffix 1961 non-null object
12 Description 198610 non-null object
13 Current Status 198900 non-null object
14 Current Status Date 198900 non-null object
15 Filed Date 198900 non-null object
16 Issued Date 183960 non-null object
17 Completed Date 97191 non-null object
18 First Construction Document Date 183954 non-null object
19 Structural Notification 6922 non-null object
20 Number of Existing Stories 156116 non-null float64
21 Number of Proposed Stories 156032 non-null float64
22 Voluntary Soft-Story Retrofit 35 non-null object
23 Fire Only Permit 18827 non-null object
24 Permit Expiration Date 147020 non-null object
25 Estimated Cost 160834 non-null float64
26 Revised Cost 192834 non-null float64
27 Existing Use 157786 non-null object
28 Existing Units 147362 non-null float64
29 Proposed Use 156461 non-null object
30 Proposed Units 147989 non-null float64
31 Plansets 161591 non-null float64
32 TIDF Compliance 2 non-null object
33 Existing Construction Type 155534 non-null float64
34 Existing Construction Type Description 155534 non-null object
35 Proposed Construction Type 155738 non-null float64
36 Proposed Construction Type Description 155738 non-null object
37 Site Permit 5359 non-null object
38 Supervisor District 197183 non-null float64
39 Neighborhoods - Analysis Boundaries 197175 non-null object
40 Zipcode 197184 non-null float64
41 Location 197200 non-null object
42 Record ID 198900 non-null int64
dtypes: float64(12), int64(3), object(28)
memory usage: 65.3+ MB
<matplotlib.axes._subplots.AxesSubplot at 0x11a837550>
Data Cleaning
Step 1 : Getting Rid Of Extras Spaces
string = " hello world "
print(string)
print("="*100)
fresh_string = string.strip()
print(fresh_string)
hello world
====================================================================================================
hello world
The simple example here show that when there are unneeded spaces before the string and after the string, we want to make sure we can get rid of them.
The simple example illustrates how when there are extra spaces before and after strings, we want to ensure that we are getting rid of these unnecessary spaces.
The primary reason for wanting to clean our data set by removing the extra spaces is because these extra spaces will lead to both analytical and formatting issues down the road. For example, when performing logistic regression analysis, a string with extra spaces in the beginning like " hello” will be treated as a different factor in the system than a string with the extra space removed (“hello”), although they have the same meaning in real life.
We can apply the same methodology, as illustrated in the code above, to the entire data set.
Step 2 :Replacing All NA Values
The second step is the most obvious step that we should do when it comes to cleaning our data set: removing the NA values that exist within the dataset. We want to replace these NA values with some value that will not influence the analytical results that we wish to do down the road. To do so, we can apply one of several techniques.
The first method is fast and easy. We can simply substitute all the NA values for 0. This method is advantageous to employ if we wish to sum the values down a column in our future analysis, as zeroes will not influence the resulting summation.
test_row = mydata.iloc[0:5]
test_row.fillna(0, inplace=True)
test_row.head(1)
Permit Number | Permit Type | Permit Type Definition | Permit Creation Date | Block | Lot | Street Number | Street Number Suffix | Street Name | Street Suffix | ... | Existing Construction Type | Existing Construction Type Description | Proposed Construction Type | Proposed Construction Type Description | Site Permit | Supervisor District | Neighborhoods - Analysis Boundaries | Zipcode | Location | Record ID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 201505065519 | 4 | sign - erect | 05/06/2015 | 0326 | 023 | 140 | 0 | Ellis | St | ... | 3.0 | constr type 3 | 0.0 | 0 | 0 | 3.0 | Tenderloin | 94102.0 | (37.785719256680785, -122.40852313194863) | 1380611233945 |
1 rows × 43 columns
However, this methodology has its limitations. For example, we can see here that the Proposed Construction Type Description cannot be 0, so it would be inappropriate to treat this NA value as a 0. Being a description, the value here should be a string instead of number. So how do we want to handle scenarios where the data in a column are strings?
There is a simple method that can be applied: we can simply substitute the original NA value in the Proposed Construction Type Description with the value that appears most frequently in that column. This method is particularly useful in future analyses where we will be searching for any outliers within the dataframe. By using the value that appears most frequeuntly in the column, we will, again, not be interfering with future statistical analyses (searching for outliers).
test_row = mydata.iloc[0:100]
test_row['Proposed Construction Type Description'].fillna(test_row['Proposed Construction Type Description'].value_counts()[:1].index.tolist()[0],inplace=True)
#using fillna command while inplace = true to replace the value
#test_row['Proposed Construction Type Description'].value_counts()[:1].index.tolist()[0] returns the most frequent value appear with in the column
test_row.head(5)
Permit Number | Permit Type | Permit Type Definition | Permit Creation Date | Block | Lot | Street Number | Street Number Suffix | Street Name | Street Suffix | ... | Existing Construction Type | Existing Construction Type Description | Proposed Construction Type | Proposed Construction Type Description | Site Permit | Supervisor District | Neighborhoods - Analysis Boundaries | Zipcode | Location | Record ID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 201505065519 | 4 | sign - erect | 05/06/2015 | 0326 | 023 | 140 | NaN | Ellis | St | ... | 3.0 | constr type 3 | NaN | wood frame (5) | NaN | 3.0 | Tenderloin | 94102.0 | (37.785719256680785, -122.40852313194863) | 1380611233945 |
1 | 201604195146 | 4 | sign - erect | 04/19/2016 | 0306 | 007 | 440 | NaN | Geary | St | ... | 3.0 | constr type 3 | NaN | wood frame (5) | NaN | 3.0 | Tenderloin | 94102.0 | (37.78733980600732, -122.41063199757738) | 1420164406718 |
2 | 201605278609 | 3 | additions alterations or repairs | 05/27/2016 | 0595 | 203 | 1647 | NaN | Pacific | Av | ... | 1.0 | constr type 1 | 1.0 | constr type 1 | NaN | 3.0 | Russian Hill | 94109.0 | (37.7946573324287, -122.42232562979227) | 1424856504716 |
3 | 201611072166 | 8 | otc alterations permit | 11/07/2016 | 0156 | 011 | 1230 | NaN | Pacific | Av | ... | 5.0 | wood frame (5) | 5.0 | wood frame (5) | NaN | 3.0 | Nob Hill | 94109.0 | (37.79595867909168, -122.41557405519474) | 1443574295566 |
4 | 201611283529 | 6 | demolitions | 11/28/2016 | 0342 | 001 | 950 | NaN | Market | St | ... | 3.0 | constr type 3 | NaN | wood frame (5) | NaN | 6.0 | Tenderloin | 94102.0 | (37.78315261897309, -122.40950883997789) | 144548169992 |
5 rows × 43 columns
Notice how some of NAs in the Proposed Construction Type Description column have now successfully changed to “wood frame(5)”, which is the most frequent value of that column.
Step 3 : Converting Strings into Numbers
When we are performing numerical analysis, such as when we are finding the mean value, sometimes the system will return an error indicating what we cannot apply the mathematical command (finding the mean value) to a set of strings. In these cases, we want to make sure our data are in the correct format; that is, that our numbers are recorded as integers as opposed to strings, so that the data can be used for further numerical analysis.
Trying to convert strings into numbers in Excel can be a messy process; however, using Python, this conversion process is really simple to use to achieve our desired result. Here, we introcude two simple commands: “float” and “int”.
test_row = mydata.iloc[0:5]
test_row['Permit Type'].astype(float)
0 4.0
1 4.0
2 3.0
3 8.0
4 6.0
Name: Permit Type, dtype: float64
test_row = mydata.iloc[0:5]
test_row['Permit Type'].astype(int)
0 4
1 4
2 3
3 8
4 6
Name: Permit Type, dtype: int64
The “float” command converts a number so that it appears in scientific notation, while the “int” command converts a decimal number into an integer by rounding down. These two commands allow for flexibility in application.
Step 4 : Removing Unnecessary Duplicates
Sometimes, rows with same content will appear twice in our dataset, which is most likely undesirable for our dataset. How can we handle these unnecessary duplicates?
# create a sample df with 2 rows and fewer columns (for demonstration)
test_df = mydata[['Permit Number', 'Permit Type', 'Block']].iloc[0:2]
# create a df with double info
dup_df = pd.concat([test_df, test_df])
dup_df.head()
# just to demonstrate, see which rows have duplicates
dup_df['Duplicate'] = dup_df.duplicated()
dup_df
# or if you have a unique identifier (like permit number) you can also use it to drop duplicates
# dup_df.drop_duplicates
Permit Number | Permit Type | Block | Duplicate | |
---|---|---|---|---|
0 | 201505065519 | 4 | 0326 | False |
1 | 201604195146 | 4 | 0306 | False |
0 | 201505065519 | 4 | 0326 | True |
1 | 201604195146 | 4 | 0306 | True |
Here, we see two rows that are duplicated, and we would like to remove these redundant rows. To do so, we use the “drop_duplicates” command to achieve our desired result: a data set that is free of unnecessary duplicates.
test_df.drop_duplicates()
Permit Number | Permit Type | Block | |
---|---|---|---|
0 | 201505065519 | 4 | 0326 |
1 | 201604195146 | 4 | 0306 |
Step 5 : Finding Potential Mismatches
The error seen in the following example is not the typical error seen in Python. This error refers more to a logical error in the data set, which we can better illustrate with the example below.
test_df1 = mydata.iloc[0:5]
test_df1.head(1)
Permit Number | Permit Type | Permit Type Definition | Permit Creation Date | Block | Lot | Street Number | Street Number Suffix | Street Name | Street Suffix | ... | Existing Construction Type | Existing Construction Type Description | Proposed Construction Type | Proposed Construction Type Description | Site Permit | Supervisor District | Neighborhoods - Analysis Boundaries | Zipcode | Location | Record ID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 201505065519 | 4 | sign - erect | 05/06/2015 | 0326 | 023 | 140 | 0 | Ellis | St | ... | 3.0 | constr type 3 | 0.0 | 0 | 0 | 3.0 | Tenderloin | 94102.0 | (37.785719256680785, -122.40852313194863) | 1380611233945 |
1 rows × 43 columns
So, what can be a potential mismatch in the dataframe?
Within our data set, we see one column labeled “Existing Construction Type” and another column labeled “Existing Constuction Type Description”. If we have an Existing Construction Type of 3.0, then we would expect that the corresponding Existing Construction Type Description to be “constr type 3”. What if, instead, the Existing Construction Type Description contained “constr type 4”?
In our data set, there is a fixed relationship between the Existing Construction Type and the Existing Construction Type Description. Any rows with differing values can be considered as an error, and we would want to correct this mismatch in our data cleaning process.
We can check our data set to see if there is any mismatching occurring in the first place:
(mydata.groupby(['Existing Construction Type', 'Existing Construction Type Description'])
.size()
.reset_index()
.rename(columns={0:'count'}))
Existing Construction Type | Existing Construction Type Description | count | |
---|---|---|---|
0 | 1.0 | constr type 1 | 28072 |
1 | 2.0 | constr type 2 | 4068 |
2 | 3.0 | constr type 3 | 9663 |
3 | 4.0 | constr type 4 | 381 |
4 | 5.0 | wood frame (5) | 113350 |
Let’s assume some mismatching problems actually occur in our data set. We can apply the following technique to correct for any of mismatching errors between the Existing Construction Type data and the Existing Construction Type Description data:
test_df1["Existing Construction Type"].unique()
print("There are 3 Existing Construction Types, which are 1, 3, and 5")
print()
print("="*100)
print()
def errorcheck(a):
if a["Existing Construction Type"] == 3.0:
a["Existing Construction Type Description"] = "constr type 3"
if a["Existing Construction Type"] == 1.0:
a["Existing Construction Type Description"] = "constr type 1"
if a["Existing Construction Type"] == 5.0:
a["Existing Construction Type Description"] = "wood frame (5)"
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-4-50dd5b06296e> in <module>
----> 1 test_df1["Existing Construction Type"].unique()
2 print("There are 3 Existing Construction Types, which are 1, 3, and 5")
3 print()
4 print("="*100)
5 print()
NameError: name 'test_df1' is not defined
for index, row in test_row.iterrows():
errorcheck(row)
test_row.head()
Permit Number | Permit Type | Permit Type Definition | Permit Creation Date | Block | Lot | Street Number | Street Number Suffix | Street Name | Street Suffix | ... | Existing Construction Type | Existing Construction Type Description | Proposed Construction Type | Proposed Construction Type Description | Site Permit | Supervisor District | Neighborhoods - Analysis Boundaries | Zipcode | Location | Record ID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 201505065519 | 4 | sign - erect | 05/06/2015 | 0326 | 023 | 140 | 0 | ellis | St | ... | 3.0 | constr type 3 | 0.0 | 0 | 0 | 3.0 | Tenderloin | 94102.0 | (37.785719256680785, -122.40852313194863) | 1380611233945 |
1 | 201604195146 | 4 | sign - erect | 04/19/2016 | 0306 | 007 | 440 | 0 | geary | St | ... | 3.0 | constr type 3 | 0.0 | 0 | 0 | 3.0 | Tenderloin | 94102.0 | (37.78733980600732, -122.41063199757738) | 1420164406718 |
2 | 201605278609 | 3 | additions alterations or repairs | 05/27/2016 | 0595 | 203 | 1647 | 0 | pacific | Av | ... | 1.0 | constr type 1 | 1.0 | constr type 1 | 0 | 3.0 | Russian Hill | 94109.0 | (37.7946573324287, -122.42232562979227) | 1424856504716 |
3 | 201611072166 | 8 | otc alterations permit | 11/07/2016 | 0156 | 011 | 1230 | 0 | pacific | Av | ... | 5.0 | wood frame (5) | 5.0 | wood frame (5) | 0 | 3.0 | Nob Hill | 94109.0 | (37.79595867909168, -122.41557405519474) | 1443574295566 |
4 | 201611283529 | 6 | demolitions | 11/28/2016 | 0342 | 001 | 950 | 0 | market | St | ... | 3.0 | constr type 3 | 0.0 | 0 | 0 | 6.0 | Tenderloin | 94102.0 | (37.78315261897309, -122.40950883997789) | 144548169992 |
5 rows × 43 columns
#there are alternative methods for doing this cleaning
#or use apply and lambda with a simplified function
# a simpler function
def constr_type(x):
if x == 3.0:
descr = "constr type 3"
elif x == 1.0:
descr = "constr type 1"
elif x == 5.0:
descr = "wood frame (5)"
else:
descr = None
return(descr)
new_df = test_df1
new_df['New Description'] = new_df['Existing Construction Type'].apply(lambda x: constr_type(x))
# or you can map a dictionary
constr_dict = {1.0: "constr type 1",
3.0: "constr type 3",
5.0: "wood frame (5)"}
new_df['New Description2'] = new_df['Existing Construction Type'].map(constr_dict)
new_df
Permit Number | Permit Type | Permit Type Definition | Permit Creation Date | Block | Lot | Street Number | Street Number Suffix | Street Name | Street Suffix | ... | Proposed Construction Type | Proposed Construction Type Description | Site Permit | Supervisor District | Neighborhoods - Analysis Boundaries | Zipcode | Location | Record ID | New Description | New Description2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 201505065519 | 4 | sign - erect | 05/06/2015 | 0326 | 023 | 140 | NaN | Ellis | St | ... | NaN | wood frame (5) | NaN | 3.0 | Tenderloin | 94102.0 | (37.785719256680785, -122.40852313194863) | 1380611233945 | constr type 3 | constr type 3 |
1 | 201604195146 | 4 | sign - erect | 04/19/2016 | 0306 | 007 | 440 | NaN | Geary | St | ... | NaN | wood frame (5) | NaN | 3.0 | Tenderloin | 94102.0 | (37.78733980600732, -122.41063199757738) | 1420164406718 | constr type 3 | constr type 3 |
2 rows × 45 columns
Step 6 : Using Consistent Upper/Lower Cases
While this step may be less relevant to our construction data set, there are many reasons why upper and lower cases should be considered in data cleaning. There may have been human errors while inputing the data, and some NLP algorithms consider upper cases in data as a factor in expressing strong opinions. Therefore, we want to ensure that we are being cautious with the usage of upper and/or lower cases in our data set.
One universal way to remove the case influence in data is to first change all values to upper case, and then change them all back to lower case again. This ensures that all our data are, in fact, consistently in lower case.
In the following example, we may want to remove the capitalization at the beginning of each street name. To remove this capitalization, we will first change all the street names into upper case, and then change them back to be in only lower case.
test_row = mydata.iloc[0:5]
test_row["Street Name"] = test_row["Street Name"].str.upper();print(test_row["Street Name"] )
0 ELLIS
1 GEARY
2 PACIFIC
3 PACIFIC
4 MARKET
Name: Street Name, dtype: object
test_row["Street Name"] = test_row["Street Name"].str.lower();print(test_row["Street Name"] )
0 ellis
1 geary
2 pacific
3 pacific
4 market
Name: Street Name, dtype: object
Using this universal method, there are no more upper cases left in the street name data. In the future, NLP algorithms can be applied to this clean data set.
Next Step
Can you find out which neighborhood has the most “constr type 3” projects? If not, what more do we need to do?
References
https://www.digitalvidya.com/blog/data-cleaning-techniques/
https://www.kaggle.com/rtatman/data-cleaning-challenge-handling-missing-values
https://www.geeksforgeeks.org/python-pandas-dataframe-drop_duplicates/
https://stackabuse.com/removing-stop-words-from-strings-in-python/#:~:text=To%20remove%20stop%20words%20from%20a%20sentence%2C%20you%20can%20divide,stop%20words%20provided%20by%20NLTK.&text=In%20the%20script%20above%2C%20we,()%20method%20from%20the%20nltk.