openSAP: Getting Started with Data Science (Edition 2021)

References:

Data preparation

Introduction

SAP HANA Predictive Analysis Library

Association Analysis
Classification Analysis
Regression
Cluster Analysis
Time Series Analysis
Probability Distribution
Outlier Detection
Link Prediction
Data Preparation
Statistic Functions (Univariate)
Statistic Functions (Multivariate)

Introduction to Project Methodologies

Phase 1.1: Determine Business Objectives

▪ Task

− The first objective of the data analyst is to thoroughly understand, from a business perspective, what the client really wants to accomplish.

Phase 1.2: Assess Situation

▪ Task

− In the previous task, your objective is to quickly get to the crux of the situation. Here, you want to flesh out the details.

▪ Outputs

− Inventory of resources

− Requirements, assumptions, and constraints

− Risks and contingencies

− Terminology

− Costs and benefits

Phase 1.3: Determine Data Science Goals

▪ Task

− A business goal states objectives in business terminology.

− A data science goal states project objectives in technical terms.

▪ Outputs

− Describe data science goals

− Define data science success criteria

Phase 1.4: Produce Project Plan

▪ Task

− Describe the intended plan for achieving the data science goals and thereby achieving the business goals.

▪ Output

− Project plan with project stages, duration, resources, etc.

− Initial assessment of tools and techniques

Defining Project Success Criteria

▪ The accuracy and robustness of the model are two major factors to determine the quality of the prediction, which reflects how successful the model is.

▪ Accuracy is often the starting point for analyzing the quality of a predictive model, as well as an obvious criterion for prediction. Accuracy measures the ratio of correct predictions to the total number of cases evaluated.

▪ The robustness of a predictive model refers to how well a model works on alternative data. This might be hold -out data or new data that the model is to be applied to. It enables you to assess how confident you are in the prediction

Initial Data Analysis

Initial data analysis ▪ “Initial data analysis (IDA) is an essential part of nearly every analysis” Problem Solving, A Statisticians Guide Christopher Chatfield ▪ Chatfield defines the various steps in IDA.

It includes analysis of:

− The structure of the data

− The quality of the data • errors, outliers, and missing observations

− Descriptive statistics

− Graphs ▪ The data is modified according to the analysis:

− Adjust extreme observations, estimate missing observations, transform variables, bin data, form new variables

Exploratory data analysis

Exploratory data analysis (EDA) is an approach to analyzing data for the purpose of formulating hypotheses that are worth testing. ▪ The objectives of EDA are to:

− Suggest hypotheses about the causes of observed phenomena

− Assess assumptions on which the analysis and statistical inference will be based

− Support the selection of appropriate statistical tools and techniques

− Provide a basis for further data collection through surveys or experiments

Data Preparation

Phase 3: Outputs

▪ Dataset

‒ This is the dataset (or datasets) produced by the Data Preparation phase, which will be used for

modeling or the major analysis work of the project.

▪ Dataset description

‒ Describes the dataset (or datasets) that will be used for the modeling or the major analysis work ofthe project.

Phase 3.1: Select Data

▪ Task

‒ Decide on the data to be used for analysis.

‒ Criteria include relevance to the data science goals and quality and technical constraints such as limits on data volume or data types.

‒ Note that data selection covers selection of attributes (columns) as well as selection of records (rows) in a table.

▪ Output – Rationale for inclusion/exclusion

‒ List the data to be included/excluded and the reasons for these decisions.

Phase 3.2: Clean Data

▪ Task

‒ Raise the data quality to the level required by the selected analysis techniques.

‒ This may involve selection of clean subsets of the data, the insertion of suitable defaults, or more ambitious techniques such as the estimation of missing data by modeling.

▪ Output – Data cleaning report

‒ Describe what decisions and actions were taken to address the data quality problems reported during the Verify Data Quality task of the Data Understanding phase

Phase 3.3: Construct Data

▪ Task

‒ This task includes constructive data preparation operations such as the production of derived attributes, entire new records, or transformed values for existing attributes.

▪ Output – Derived attributes

‒ Derived attributes are new attributes that are constructed from one or more existing attributes in the same record. Examples: area = length * width.

▪ Output – Generated records

‒ Describe the creation of completely new records.

Phase 3.4: Integrate Data

▪ Task

‒ These are methods whereby information is combined from multiple tables or records to create new records or values.

▪ Output – Merged data

‒ Merging tables refer to joining together two or more tables that have different information about the same objects.

‒ Merged data also covers aggregations.

Phase 3.5: Format Data

▪ Task

‒ Formatting transformations refer to primarily syntactic modifications made to the data that do not change its meaning but might be required by the modeling tool.

▪ Output – Reformatted data

‒ Some tools have requirements on the order of the attributes, such as the first field being a unique

identifier for each record or the last field being the outcome field the model is to predict.

Predictive Modeling Methodology – Overview

\Sometimes when you design a model, you might want to build in a “latency” period. Many datasets used for predictive modeling have the following structure: ▪ Historic Data: (in the past, compared to the reference date) with dynamic data computed in relation to the reference date. Usually short-term, mid-term, and long-term indicators. ▪ Latency Period: (starting after the reference state) a period where no data is collected. This is used to represent the time required by the business to collect new data, apply the model, produce the scores, and define the campaign. Not all predictive models require a latency period, although many churn models will. ▪ Target: (starting after the reference state + latency period) a period where the targeted behavior is observed.

Data Manipulation

The first step is to identify the “entity” for the analysis.

− An entity is the object targeted by the planned analytical task.

− It may be a customer, a product, a store, etc., and is usually identified by a unique identifier.

− The entity defines the “granularity” of the analysis.

The analytical record is a 360o view of each entity, collecting all of the static and dynamic data together that can be used to define the entity.

Binning is one of the fundamental feature engineering techniques. ▪ The original data values which fall into a given small interval, a bin, are replaced by a value representative of that interval, often the central value.

When you know that there can be multiple entries for one individual in a transaction table, for example, you have to compute an aggregate to avoid creating duplicates in your dataset.

Data Encoding

The data encoding process prepares missing values in the data, deals with outliers, and creates data bins or bands to transform raw data into a “mineable” source of information.

Selecting Data – Variable and Feature Selection

“Feature selection” is the process of selecting a subset of relevant explanatory variables or predictors for use in data science model construction. ▪ It is also known as variable selection, attribute selection, or variable subset selection. ▪ Often, data contains many features that are either redundant or irrelevant, and can be removed without incurring much loss of information. ▪ Remember that domain knowledge can be the best selection criterion of all!!

Backward elimination 1. Backward elimination starts with all candidate features. 2. Test the deletion of each feature using the chosen model comparison criterion, deleting the feature (if any) that improves the model the most by being deleted. 3. Repeat this process until no further improvement is possible.

orward selection 1. Forward selection starts with no features in the model. 2. Test the addition of each feature using the chosen model comparison criterion. 3. Add the feature (if any) that improves the model the most. 4. Repeat this process until no other feature additions improve the model.

Classification Analysis with Decision Trees

Strengths

– The tree-type output is very visual and easy to understand

– They are able to produce ‘understandable’ rules

– They can perform classification without requiring much computation

– They can handle both continuous and categorical variables

– They provide a clear indication of variable importance

▪ Weaknesses

– Clearly sensitive to the ‘first split’

– Some decision tree algorithms require binary target variables

– They can be computationally expensive

– They generally examine just a single field at a time

– They are prone to over-fitting

Classification Analysis with KNN, NN, and SVM

Strengths

‒ They can handle a wide range of problems

‒ They can produce good results even in complex non-linear

domains

‒ They can handle both categorical and continuous variables

▪ Weaknesses

‒ Black box – hard to explain results

‒ NNs need large amounts of data

‒ Computationally expensive

‒ Potential to over-fit

‒ No hard and fast rule to determine best network structure

Time Series Analysis

Stationary, trend, and seasonality

Search This Blog

Toolkit for SAP consultants