Monday, September 29, 2014

Data Cleaning and Data Preparation / Tidying - Weather data example

Data Cleaning

The code in R can be reviewed here.

For an effective Data Analysis, it is important that data is cleaned and it is in appropriate structure (tidying) for analysis. Often, this step takes as much as 80% of the effort.
Here are some of the work involved in Data Cleaning:
  • Review the structure and shape of the data.
  • Meta data cleaning
  • Variables and observations cleaning

Reviewing the structure and shape of the data

  • Get the count of Observations (rows) and Variables (columns)
  • Get the names of the variables
  • Get the data types and sample values for variables
  • Review few top and bottom values for the observations
  • Get the summary statistics and distributions of the observations

Meta Data Cleaning

  • Normalize Variable Names
  • Correct variable data formats
  • Identify the roles played by the variables within the data

Variables and observations cleaning

  • Ignore IDs, Outputs, and Missing observations
  • Ignore too many levels, constants
  • Identify and remove correlated variables
  • Identify the subsets of variables that are more effective for modeling
  • Remove observations having missing Target
  • Deal with Missing Values
  • Omitting Observations with missing values
  • Normalize Factors
  • Ensure Target is Categorical for classification models

Data Preparation and tidying

After the data is cleaned, it needs to be structured properly so that it is suitable for analysis. The following typical activities are done in data preparation and tidying:

  • Identify the variables that we will use to build the model
  • Identify the numeric and categorical variables
  • Save Dataset



Explanations:

Normalize Variable Names
Sometimes it is convenient to map all variable names to lower-case. R is case sensitive, so doing this does change the variable names. This can be useful when different upper/lower case conventions are intermixed in names like Incm_tax_PyBl and remembering how to capitalize when interactively exploring the data with 1,000 such variables is an annoyance. We often see such variable names arising when we import data from databases which are often case insensitive.

Identify the roles played by the variables within the data
We need to identify the variables that are not relevant for the data analysis, e.g. date (we could turn it into a seasonal variable which might be useful). Also we remove the second variable (Location) as in the data here it is a constant. We also identify the risk variable, if it is provided—it is a measure of the amount of risk or the importance of an observation with respect to the target variable. The risk is an output variable, and thus should not be used as an input to the modeling.

Ignore IDs, Outputs, Missing observations
Ignore some variables that are irrelevant or inappropriate for modeling.

IDs and Outputs: We start with the identifiers and the risk variable (which is an output variable). These should play no role in the modeling. Always watch out for including output variables as inputs to the modeling. This is one trap I regularly see from beginners.
We might also identify any variable that has a unique value for every observation. These are sometimes identifiers as well and if so are candidates for ignoring.

All Missing: We then remove any variables where all of the values are missing. There are none like this in the weather dataset, but in general across 1,000 variables, there may be some. We first count the number of missing values for each variable, and then list the names of those variables with only missing values.

Many Missing Perhaps we also want to ignore variables with more than 70% of the values missing.

Ignore too many levels, constants
Too Many Levels: We might also want to ignore variables with too many levels. Another approach is to group the levels into a smaller number of levels, but here we simply ignore them
Constants: Ignore variables with constant values.

Identify and remove correlated variables
Here we can identify pairs where we want to keep one but not the other, because they are highly correlated. We will select them manually since it is a judgment call. Normally we might limit the removals to those correlations that are 0.95 or more.

Remove observations having missing Target
Sometimes there may be further operations to perform on the dataset prior to modelling. This can include dealing with missing values, converting variables to their correct type, etc. Here, we remove observations with a missing target.

Deal with Missing Values
Missing values for the variables are an issue for some but not all model builders. For example, randomForest() has not been coded to handle missing values whilst rpart() has a particularly well developed approach to dealing with missing values.

We may want to impute missing values in the data (not always wise to do so). Here we do this using na.roughfix() from randomForest.

Omitting Observations with missing values
We might want to simply remove observations that have missing values. Here na.omit() identifies the rows to omit based on the vars to be included for modelling. This list of rows to omit is stored as the na.action attribute of the returned object. We then remove these observations from the dataset.

Normalize Factors
Some variables will have levels with spaces, and mixture of cases, etc. We may like to normalize the levels for each of the categoric variables. For very large datasets this can take some time and so we may want to be selective.

Ensure Target is Categorical for classification models
For classification models we want to ensure the target is categorical. Often it is 0/1 and hence is loaded as numeric. We could tell our modeler of choice to explicitly do classification, or set the target using as.factor() in the formula, but it is generally cleaner to do this here, and this is a no-op if the target is already categorical.

Identify the numeric and categorical variables
Many cluster analysis algorithms only deal with numeric variables, for example. Here we identify them both by name and by index. Note that when using the index we have to assume the variables always remain in the same order within the dataset and all variables are present. Otherwise the indicies will get out of sync.

No comments:

Post a Comment