Data Science and Analytics blog

Data Cleaning

The code in R can be reviewed here.

For an effective Data Analysis, it is important that data is cleaned and it is in appropriate structure (tidying) for analysis. Often, this step takes as much as 80% of the effort.
Here are some of the work involved in Data Cleaning:

Review the structure and shape of the data.
Meta data cleaning
Variables and observations cleaning

Reviewing the structure and shape of the data

Get the count of Observations (rows) and Variables (columns)
Get the names of the variables
Get the data types and sample values for variables
Review few top and bottom values for the observations
Get the summary statistics and distributions of the observations

Meta Data Cleaning

Normalize Variable Names
Correct variable data formats
Identify the roles played by the variables within the data

Variables and observations cleaning

Ignore IDs, Outputs, and Missing observations
Ignore too many levels, constants
Identify and remove correlated variables
Identify the subsets of variables that are more effective for modeling
Remove observations having missing Target
Deal with Missing Values
Omitting Observations with missing values
Normalize Factors
Ensure Target is Categorical for classification models

Data Preparation and tidying

After the data is cleaned, it needs to be structured properly so that it is suitable for analysis. The following typical activities are done in data preparation and tidying:

Identify the variables that we will use to build the model
Identify the numeric and categorical variables
Save Dataset

Explanations:

Normalize Variable Names
Sometimes it is convenient to map all variable names to lower-case. R is case sensitive, so doing this does change the variable names. This can be useful when different upper/lower case conventions are intermixed in names like Incm_tax_PyBl and remembering how to capitalize when interactively exploring the data with 1,000 such variables is an annoyance. We often see such variable names arising when we import data from databases which are often case insensitive.

Identify the roles played by the variables within the data
We need to identify the variables that are not relevant for the data analysis, e.g. date (we could turn it into a seasonal variable which might be useful). Also we remove the second variable (Location) as in the data here it is a constant. We also identify the risk variable, if it is provided—it is a measure of the amount of risk or the importance of an observation with respect to the target variable. The risk is an output variable, and thus should not be used as an input to the modeling.

Ignore IDs, Outputs, Missing observations

Ignore some variables that are irrelevant or inappropriate for modeling.

IDs and Outputs: We start with the identifiers and the risk variable (which is an output variable). These should play no role in the modeling. Always watch out for including output variables as inputs to the modeling. This is one trap I regularly see from beginners.
We might also identify any variable that has a unique value for every observation. These are sometimes identifiers as well and if so are candidates for ignoring.

All Missing: We then remove any variables where all of the values are missing. There are none like this in the weather dataset, but in general across 1,000 variables, there may be some. We first count the number of missing values for each variable, and then list the names of those variables with only missing values.

Many Missing Perhaps we also want to ignore variables with more than 70% of the values missing.

Ignore too many levels, constants
Too Many Levels: We might also want to ignore variables with too many levels. Another approach is to group the levels into a smaller number of levels, but here we simply ignore them
Constants: Ignore variables with constant values.

Identify and remove correlated variables
Here we can identify pairs where we want to keep one but not the other, because they are highly correlated. We will select them manually since it is a judgment call. Normally we might limit the removals to those correlations that are 0.95 or more.

Remove observations having missing Target
Sometimes there may be further operations to perform on the dataset prior to modelling. This can include dealing with missing values, converting variables to their correct type, etc. Here, we remove observations with a missing target.

Deal with Missing Values
Missing values for the variables are an issue for some but not all model builders. For example, randomForest() has not been coded to handle missing values whilst rpart() has a particularly well developed approach to dealing with missing values.

We may want to impute missing values in the data (not always wise to do so). Here we do this using na.roughfix() from randomForest.

Omitting Observations with missing values
We might want to simply remove observations that have missing values. Here na.omit() identifies the rows to omit based on the vars to be included for modelling. This list of rows to omit is stored as the na.action attribute of the returned object. We then remove these observations from the dataset.

Normalize Factors
Some variables will have levels with spaces, and mixture of cases, etc. We may like to normalize the levels for each of the categoric variables. For very large datasets this can take some time and so we may want to be selective.

Ensure Target is Categorical for classification models
For classification models we want to ensure the target is categorical. Often it is 0/1 and hence is loaded as numeric. We could tell our modeler of choice to explicitly do classification, or set the target using as.factor() in the formula, but it is generally cleaner to do this here, and this is a no-op if the target is already categorical.

Identify the numeric and categorical variables
Many cluster analysis algorithms only deal with numeric variables, for example. Here we identify them both by name and by index. Note that when using the index we have to assume the variables always remain in the same order within the dataset and all variables are present. Otherwise the indicies will get out of sync.

a. Discovery

What is the question to be answered and does an adequate data set exist such that a meaningful analysis can be performed.

b. Data acquisition

Here you will be gathering and staging the data in your environment. You should always try to start with data that has not been processed in any way. That is to say, you should always start with raw data. If you cannot get the raw data, you need to understand how the data was transformed and/or processed.

c. Data Cleansing

Raw data will almost always need to be cleansed. Tidy Data best practices should always be applied.

d. Data Munging

Alter names of data variables, split data variables into two or more, reshape data, seek outliers and remove if necessary, etc. During this phase, data is split into “training” and “test” data sets if applicable.

e. Data Exploration and Model Planning

Initial exploratory graphs, look for correlations in multivariate data, etc. This is the step where you begin to understand which types of models will be used to answer the questions asked of the data.

f. Data Analysis and Model Building

Training data is analyzed with model selections, results are recorded and patterns take shape. Final models are then applied (when applicable) to the “test” data.

g. Interpretation and scrutinization of results

During this time you ask yourself whether or not the results make sense, whether or not the correct models were applied, whether or not the test data set was complete, “are these results valid.” Always question the outcome of your analysis.

h. Analysis write-up

Write up a clear and methodical report of the sequence of events that culminated in the final analysis. It should be noted that you do not need to include every single step of your analysis. You need only include those steps that tell a complete, but compact story. It is critical that your write-up is written for the intended audience. If it is for non-technical people, ensure that the message is written in a language that is easily digestible by them. If the audience is technical, such as data scientists, then ensure the details of your modeling are described. Know your audience and write to them in a way that ensures your work is understood.

i. Operationalize

Produce your report, scripts, code, and supporting technical documentation. Run a pilot experiment and implement your models in a production environment.

Source:

http://datatechblog.com/2013/12/data-analysis-using-r-gathering-organizing-munging/

Data Science and Analytics blog

Monday, September 29, 2014

Data Cleaning and Data Preparation / Tidying - Weather data example

Data Cleaning

Reviewing the structure and shape of the data

Meta Data Cleaning

Variables and observations cleaning

Data Preparation and tidying

Blog Index

Typical phases of a Data Analytics project