Data Cleaning

Author: Hao Zheng

It is always amazing to see how insights can be generated by data scientists. However, in real life, not all collected data sets can be directly analyzed with existing tools, primarily due to the chaotic nature of data. Data scientists spend 80% of their time in simply cleaning up the data. Therefore, it is important to understand how to manage messy data in order to transform them into organized data that are ready for further processing.

In this Project, we will be introducing eight basic techniques that can be applied to most of the data sets that you will encounter. Our goal is to transform extremely messy data into less messy data. It is likely that, after applying these techniques, your data set will still not completely ready for further processing. This is because each data set has its unique problems, and you find yourself needing to adjust the application of these data cleaning techniques based on your data set’s problems. However, having these techniques in your back pocket will certainly be an asset as you prepare and clean your data.

We encourage you to create your own Jupytor notebook and follow along. You can also download this notebook together with any affiliated data in the Notebooks and Data GitHub repository. Alternatively, if you do not have Python or Jupyter Notebook installed yet, you may experiment with a virtual notebook by launching Binder or Syzygy below (learn more about these two tools in the Resource tab).

Launch Syzygy (UBC)

Launch Syzygy (Google)

Launch Binder

Business Problem

Why do we need to clean the data before applying our desired statistical analyses, when we are really interested in generating business insights? In order to answer this question, we need to understand that there is a gap between raw data and the insights that business managers are interested in. Let’s use the retail industry as an example. Questions that are usually asked in the retail industry include:

  • Q1: Which geographic location provides us with the most profit per person?
  • Q2: What type of customers generate the most revenue for us?

However, the raw data set can only record the transactions from customers enrolled in the business’ loyalty program. In the process of recording these transactions, we may miss some of the information that is required to solve the aforementioned problems, such as:

  • The recorded transaction only included the name of the store, but you do not know where some of these stores are located.
  • The customers were not willing to share certain personal information due to privacy concerns.

In such scenarios, if we attempted to directly apply our statistical methods to this raw data set, we would likely have peculiar results on our hands:

  • Q1: The stores in Happyland (perhaps a community with only 1,000 residents) provides us with the most profit per person.
  • Q2: The customers who were not willing to share their gender identity generate the most revenue for us.

The above two insights are only “correct” in a statistical sense. In a business context, they provide little to no value. For example, it would not make sense for us to open up more stores in the small, but profitable, community of Happyland, and knowing that those customers who care about their privacy are the ones who generate the most revenue for us does not provide a good indication of who they truly are. Therefore, we use the data cleaning process to help close the gap between the messy, raw data set and the insights desired by business managers.

Let’s walk through the data cleaning process using an ongoing example for the remainder of the Project. Here, we have a data set that captures the constuction developments that occur around different neighbourhoods. Can we determine which neighbourhood has the most “constr type 3” construction projects?

Data Set Check

import pandas as pd
import numpy as np
import warnings

# This step prevent warning from showing up for formatting purpose,but ignore this line while practicing yourself
warnings.filterwarnings("ignore")

mydata = pd.read_csv("nb0007_data/Building_Permits.csv")
mydata.head(1)

Permit NumberPermit TypePermit Type DefinitionPermit Creation DateBlockLotStreet NumberStreet Number SuffixStreet NameStreet Suffix...Existing Construction TypeExisting Construction Type DescriptionProposed Construction TypeProposed Construction Type DescriptionSite PermitSupervisor DistrictNeighborhoods - Analysis BoundariesZipcodeLocationRecord ID
02015050655194sign - erect05/06/20150326023140NaNEllisSt...3.0constr type 3NaNNaNNaN3.0Tenderloin94102.0(37.785719256680785, -122.40852313194863)1380611233945

1 rows × 43 columns

Before moving into data cleaning, we need to understand what is happening in the dataset so that we can know in which way the data set is “sick:. It is obvious that there are lots of NaN values existing in the dataset. We definitely don’t want NA values here otherwise we will be having “security concerning customer” issue. That extra space and and NA value would be what we want to look at first.

Here, we have our first glimpse of the data. Before moving into the data cleaning process, it is crucial to grasp what is happening in the data set in order to understand the ways in which our data set is problematic and thus needs to be cleaned. It is obvious that there are several “NaN” values existing in the data set above. We definitely do not want to have NA values in our data set, or else we may run into analytical issues later on when we try and generate business insights from this data set. Let’s start by tackling the extra spaces (Step 1) and the NA values (Step 2) in our example data set.

# get info about all columns
mydata.info()

# create some exploratory visualization
mydata.isna().sum().plot(kind='bar')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 198900 entries, 0 to 198899
Data columns (total 43 columns):
 #   Column                                  Non-Null Count   Dtype
---  ------                                  --------------   -----
 0   Permit Number                           198900 non-null  object
 1   Permit Type                             198900 non-null  int64
 2   Permit Type Definition                  198900 non-null  object
 3   Permit Creation Date                    198900 non-null  object
 4   Block                                   198900 non-null  object
 5   Lot                                     198900 non-null  object
 6   Street Number                           198900 non-null  int64
 7   Street Number Suffix                    2216 non-null    object
 8   Street Name                             198900 non-null  object
 9   Street Suffix                           196132 non-null  object
 10  Unit                                    29479 non-null   float64
 11  Unit Suffix                             1961 non-null    object
 12  Description                             198610 non-null  object
 13  Current Status                          198900 non-null  object
 14  Current Status Date                     198900 non-null  object
 15  Filed Date                              198900 non-null  object
 16  Issued Date                             183960 non-null  object
 17  Completed Date                          97191 non-null   object
 18  First Construction Document Date        183954 non-null  object
 19  Structural Notification                 6922 non-null    object
 20  Number of Existing Stories              156116 non-null  float64
 21  Number of Proposed Stories              156032 non-null  float64
 22  Voluntary Soft-Story Retrofit           35 non-null      object
 23  Fire Only Permit                        18827 non-null   object
 24  Permit Expiration Date                  147020 non-null  object
 25  Estimated Cost                          160834 non-null  float64
 26  Revised Cost                            192834 non-null  float64
 27  Existing Use                            157786 non-null  object
 28  Existing Units                          147362 non-null  float64
 29  Proposed Use                            156461 non-null  object
 30  Proposed Units                          147989 non-null  float64
 31  Plansets                                161591 non-null  float64
 32  TIDF Compliance                         2 non-null       object
 33  Existing Construction Type              155534 non-null  float64
 34  Existing Construction Type Description  155534 non-null  object
 35  Proposed Construction Type              155738 non-null  float64
 36  Proposed Construction Type Description  155738 non-null  object
 37  Site Permit                             5359 non-null    object
 38  Supervisor District                     197183 non-null  float64
 39  Neighborhoods - Analysis Boundaries     197175 non-null  object
 40  Zipcode                                 197184 non-null  float64
 41  Location                                197200 non-null  object
 42  Record ID                               198900 non-null  int64
dtypes: float64(12), int64(3), object(28)
memory usage: 65.3+ MB





<matplotlib.axes._subplots.AxesSubplot at 0x11a837550>

png

Data Cleaning

Step 1 : Getting Rid Of Extras Spaces


string = "   hello world     "

print(string)
print("="*100)

fresh_string = string.strip()
print(fresh_string)
   hello world
====================================================================================================
hello world

The simple example here show that when there are unneeded spaces before the string and after the string, we want to make sure we can get rid of them.

The simple example illustrates how when there are extra spaces before and after strings, we want to ensure that we are getting rid of these unnecessary spaces.

The primary reason for wanting to clean our data set by removing the extra spaces is because these extra spaces will lead to both analytical and formatting issues down the road. For example, when performing logistic regression analysis, a string with extra spaces in the beginning like " hello” will be treated as a different factor in the system than a string with the extra space removed (“hello”), although they have the same meaning in real life.

We can apply the same methodology, as illustrated in the code above, to the entire data set.

Step 2 :Replacing All NA Values


The second step is the most obvious step that we should do when it comes to cleaning our data set: removing the NA values that exist within the dataset. We want to replace these NA values with some value that will not influence the analytical results that we wish to do down the road. To do so, we can apply one of several techniques.

The first method is fast and easy. We can simply substitute all the NA values for 0. This method is advantageous to employ if we wish to sum the values down a column in our future analysis, as zeroes will not influence the resulting summation.

test_row = mydata.iloc[0:5]
test_row.fillna(0, inplace=True)
test_row.head(1)

Permit NumberPermit TypePermit Type DefinitionPermit Creation DateBlockLotStreet NumberStreet Number SuffixStreet NameStreet Suffix...Existing Construction TypeExisting Construction Type DescriptionProposed Construction TypeProposed Construction Type DescriptionSite PermitSupervisor DistrictNeighborhoods - Analysis BoundariesZipcodeLocationRecord ID
02015050655194sign - erect05/06/201503260231400EllisSt...3.0constr type 30.0003.0Tenderloin94102.0(37.785719256680785, -122.40852313194863)1380611233945

1 rows × 43 columns

However, this methodology has its limitations. For example, we can see here that the Proposed Construction Type Description cannot be 0, so it would be inappropriate to treat this NA value as a 0. Being a description, the value here should be a string instead of number. So how do we want to handle scenarios where the data in a column are strings?

There is a simple method that can be applied: we can simply substitute the original NA value in the Proposed Construction Type Description with the value that appears most frequently in that column. This method is particularly useful in future analyses where we will be searching for any outliers within the dataframe. By using the value that appears most frequeuntly in the column, we will, again, not be interfering with future statistical analyses (searching for outliers).

test_row = mydata.iloc[0:100]

test_row['Proposed Construction Type Description'].fillna(test_row['Proposed Construction Type Description'].value_counts()[:1].index.tolist()[0],inplace=True)
#using fillna command while inplace = true to replace the value
#test_row['Proposed Construction Type Description'].value_counts()[:1].index.tolist()[0] returns the most frequent value appear with in the column
test_row.head(5)

Permit NumberPermit TypePermit Type DefinitionPermit Creation DateBlockLotStreet NumberStreet Number SuffixStreet NameStreet Suffix...Existing Construction TypeExisting Construction Type DescriptionProposed Construction TypeProposed Construction Type DescriptionSite PermitSupervisor DistrictNeighborhoods - Analysis BoundariesZipcodeLocationRecord ID
02015050655194sign - erect05/06/20150326023140NaNEllisSt...3.0constr type 3NaNwood frame (5)NaN3.0Tenderloin94102.0(37.785719256680785, -122.40852313194863)1380611233945
12016041951464sign - erect04/19/20160306007440NaNGearySt...3.0constr type 3NaNwood frame (5)NaN3.0Tenderloin94102.0(37.78733980600732, -122.41063199757738)1420164406718
22016052786093additions alterations or repairs05/27/201605952031647NaNPacificAv...1.0constr type 11.0constr type 1NaN3.0Russian Hill94109.0(37.7946573324287, -122.42232562979227)1424856504716
32016110721668otc alterations permit11/07/201601560111230NaNPacificAv...5.0wood frame (5)5.0wood frame (5)NaN3.0Nob Hill94109.0(37.79595867909168, -122.41557405519474)1443574295566
42016112835296demolitions11/28/20160342001950NaNMarketSt...3.0constr type 3NaNwood frame (5)NaN6.0Tenderloin94102.0(37.78315261897309, -122.40950883997789)144548169992

5 rows × 43 columns

Notice how some of NAs in the Proposed Construction Type Description column have now successfully changed to “wood frame(5)”, which is the most frequent value of that column.

Step 3 : Converting Strings into Numbers


When we are performing numerical analysis, such as when we are finding the mean value, sometimes the system will return an error indicating what we cannot apply the mathematical command (finding the mean value) to a set of strings. In these cases, we want to make sure our data are in the correct format; that is, that our numbers are recorded as integers as opposed to strings, so that the data can be used for further numerical analysis.

Trying to convert strings into numbers in Excel can be a messy process; however, using Python, this conversion process is really simple to use to achieve our desired result. Here, we introcude two simple commands: “float” and “int”.

test_row = mydata.iloc[0:5]
test_row['Permit Type'].astype(float)
0    4.0
1    4.0
2    3.0
3    8.0
4    6.0
Name: Permit Type, dtype: float64
test_row = mydata.iloc[0:5]
test_row['Permit Type'].astype(int)
0    4
1    4
2    3
3    8
4    6
Name: Permit Type, dtype: int64

The “float” command converts a number so that it appears in scientific notation, while the “int” command converts a decimal number into an integer by rounding down. These two commands allow for flexibility in application.

Step 4 : Removing Unnecessary Duplicates


Sometimes, rows with same content will appear twice in our dataset, which is most likely undesirable for our dataset. How can we handle these unnecessary duplicates?

# create a sample df with 2 rows and fewer columns (for demonstration)
test_df = mydata[['Permit Number', 'Permit Type', 'Block']].iloc[0:2]
# create a df with double info
dup_df = pd.concat([test_df, test_df])
dup_df.head()

# just to demonstrate, see which rows have duplicates
dup_df['Duplicate'] = dup_df.duplicated()
dup_df

# or if you have a unique identifier (like permit number) you can also use it to drop duplicates
# dup_df.drop_duplicates

Permit NumberPermit TypeBlockDuplicate
020150506551940326False
120160419514640306False
020150506551940326True
120160419514640306True

Here, we see two rows that are duplicated, and we would like to remove these redundant rows. To do so, we use the “drop_duplicates” command to achieve our desired result: a data set that is free of unnecessary duplicates.

test_df.drop_duplicates()

Permit NumberPermit TypeBlock
020150506551940326
120160419514640306

Step 5 : Finding Potential Mismatches


The error seen in the following example is not the typical error seen in Python. This error refers more to a logical error in the data set, which we can better illustrate with the example below.

test_df1 = mydata.iloc[0:5]
test_df1.head(1)

Permit NumberPermit TypePermit Type DefinitionPermit Creation DateBlockLotStreet NumberStreet Number SuffixStreet NameStreet Suffix...Existing Construction TypeExisting Construction Type DescriptionProposed Construction TypeProposed Construction Type DescriptionSite PermitSupervisor DistrictNeighborhoods - Analysis BoundariesZipcodeLocationRecord ID
02015050655194sign - erect05/06/201503260231400EllisSt...3.0constr type 30.0003.0Tenderloin94102.0(37.785719256680785, -122.40852313194863)1380611233945

1 rows × 43 columns

So, what can be a potential mismatch in the dataframe?

Within our data set, we see one column labeled “Existing Construction Type” and another column labeled “Existing Constuction Type Description”. If we have an Existing Construction Type of 3.0, then we would expect that the corresponding Existing Construction Type Description to be “constr type 3”. What if, instead, the Existing Construction Type Description contained “constr type 4”?

In our data set, there is a fixed relationship between the Existing Construction Type and the Existing Construction Type Description. Any rows with differing values can be considered as an error, and we would want to correct this mismatch in our data cleaning process.

We can check our data set to see if there is any mismatching occurring in the first place:

(mydata.groupby(['Existing Construction Type', 'Existing Construction Type Description'])
    .size()
    .reset_index()
    .rename(columns={0:'count'}))

Existing Construction TypeExisting Construction Type Descriptioncount
01.0constr type 128072
12.0constr type 24068
23.0constr type 39663
34.0constr type 4381
45.0wood frame (5)113350

Let’s assume some mismatching problems actually occur in our data set. We can apply the following technique to correct for any of mismatching errors between the Existing Construction Type data and the Existing Construction Type Description data:

test_df1["Existing Construction Type"].unique()
print("There are 3 Existing Construction Types, which are 1, 3, and 5")
print()
print("="*100)
print()

def errorcheck(a):
    if a["Existing Construction Type"] == 3.0:
        a["Existing Construction Type Description"] = "constr type 3"
    if a["Existing Construction Type"] == 1.0:
        a["Existing Construction Type Description"] = "constr type 1"
    if a["Existing Construction Type"] == 5.0:
        a["Existing Construction Type Description"] = "wood frame (5)"
---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

<ipython-input-4-50dd5b06296e> in <module>
----> 1 test_df1["Existing Construction Type"].unique()
      2 print("There are 3 Existing Construction Types, which are 1, 3, and 5")
      3 print()
      4 print("="*100)
      5 print()


NameError: name 'test_df1' is not defined
for index, row in test_row.iterrows():
    errorcheck(row)
test_row.head()

Permit NumberPermit TypePermit Type DefinitionPermit Creation DateBlockLotStreet NumberStreet Number SuffixStreet NameStreet Suffix...Existing Construction TypeExisting Construction Type DescriptionProposed Construction TypeProposed Construction Type DescriptionSite PermitSupervisor DistrictNeighborhoods - Analysis BoundariesZipcodeLocationRecord ID
02015050655194sign - erect05/06/201503260231400ellisSt...3.0constr type 30.0003.0Tenderloin94102.0(37.785719256680785, -122.40852313194863)1380611233945
12016041951464sign - erect04/19/201603060074400gearySt...3.0constr type 30.0003.0Tenderloin94102.0(37.78733980600732, -122.41063199757738)1420164406718
22016052786093additions alterations or repairs05/27/2016059520316470pacificAv...1.0constr type 11.0constr type 103.0Russian Hill94109.0(37.7946573324287, -122.42232562979227)1424856504716
32016110721668otc alterations permit11/07/2016015601112300pacificAv...5.0wood frame (5)5.0wood frame (5)03.0Nob Hill94109.0(37.79595867909168, -122.41557405519474)1443574295566
42016112835296demolitions11/28/201603420019500marketSt...3.0constr type 30.0006.0Tenderloin94102.0(37.78315261897309, -122.40950883997789)144548169992

5 rows × 43 columns

#there are alternative methods for doing this cleaning
#or use apply and lambda with a simplified function

# a simpler function

def constr_type(x):
    if x == 3.0:
        descr = "constr type 3"
    elif x == 1.0:
        descr = "constr type 1"
    elif x == 5.0:
        descr = "wood frame (5)"
    else:
        descr = None
    return(descr)

new_df = test_df1
new_df['New Description'] = new_df['Existing Construction Type'].apply(lambda x: constr_type(x))


# or you can map a dictionary
constr_dict = {1.0: "constr type 1",
              3.0: "constr type 3",
              5.0: "wood frame (5)"}

new_df['New Description2'] = new_df['Existing Construction Type'].map(constr_dict)
new_df

Permit NumberPermit TypePermit Type DefinitionPermit Creation DateBlockLotStreet NumberStreet Number SuffixStreet NameStreet Suffix...Proposed Construction TypeProposed Construction Type DescriptionSite PermitSupervisor DistrictNeighborhoods - Analysis BoundariesZipcodeLocationRecord IDNew DescriptionNew Description2
02015050655194sign - erect05/06/20150326023140NaNEllisSt...NaNwood frame (5)NaN3.0Tenderloin94102.0(37.785719256680785, -122.40852313194863)1380611233945constr type 3constr type 3
12016041951464sign - erect04/19/20160306007440NaNGearySt...NaNwood frame (5)NaN3.0Tenderloin94102.0(37.78733980600732, -122.41063199757738)1420164406718constr type 3constr type 3

2 rows × 45 columns

Step 6 : Using Consistent Upper/Lower Cases


While this step may be less relevant to our construction data set, there are many reasons why upper and lower cases should be considered in data cleaning. There may have been human errors while inputing the data, and some NLP algorithms consider upper cases in data as a factor in expressing strong opinions. Therefore, we want to ensure that we are being cautious with the usage of upper and/or lower cases in our data set.

One universal way to remove the case influence in data is to first change all values to upper case, and then change them all back to lower case again. This ensures that all our data are, in fact, consistently in lower case.

In the following example, we may want to remove the capitalization at the beginning of each street name. To remove this capitalization, we will first change all the street names into upper case, and then change them back to be in only lower case.

test_row = mydata.iloc[0:5]
test_row["Street Name"] = test_row["Street Name"].str.upper();print(test_row["Street Name"] )
0      ELLIS
1      GEARY
2    PACIFIC
3    PACIFIC
4     MARKET
Name: Street Name, dtype: object
test_row["Street Name"] = test_row["Street Name"].str.lower();print(test_row["Street Name"] )
0      ellis
1      geary
2    pacific
3    pacific
4     market
Name: Street Name, dtype: object

Using this universal method, there are no more upper cases left in the street name data. In the future, NLP algorithms can be applied to this clean data set.

Next Step


Can you find out which neighborhood has the most “constr type 3” projects? If not, what more do we need to do?

References


https://www.digitalvidya.com/blog/data-cleaning-techniques/

https://www.kaggle.com/rtatman/data-cleaning-challenge-handling-missing-values

https://www.geeksforgeeks.org/python-pandas-dataframe-drop_duplicates/

https://stackabuse.com/removing-stop-words-from-strings-in-python/#:~:text=To%20remove%20stop%20words%20from%20a%20sentence%2C%20you%20can%20divide,stop%20words%20provided%20by%20NLTK.&text=In%20the%20script%20above%2C%20we,()%20method%20from%20the%20nltk.