Introduction) Human Resource Analytics
Questions:
- So why do our best and most experienced employees leave prematurely?
- How can we normalize effort levels based on employee salary?
- Would a better salary & work environment improve effort/satisfication levels?
- Which sales industry performs better? What would happen if you were to introduce competition?
- Can you anaylze the decision or preferences made by your empolyees?
Variables (That I make for clarfication):
-
satisfaction_level
: Satisfaction level from (.09 - 1) probably filled in by employee. -
last_evaluation
: Last_evaluation (performance) level from (.36 - 1) probably filled in by the employee’s manager. -
number_project
: Number of projects that the employee has worked on. -
average_monthly_hours
: The average number of hours an employee worked in a month. -
time_spend_company
: Could this be the number of years an employee worked for this company? -
Work_accident
: Whether an employee has had an accident or not (0 = no, 1 = yes). -
left
: Whether an employee has left or not (0 = no, 1 = yes). -
promoted_last_5years
: Whether an employee has a promotion or not (0 = no, 1 = yes). -
sales
: The different department sectors of this company. -
salary
: Three-level salary ranging from (low
,medium
,high
).
Descripitive Analytics) Exploring the Data
First I want to import and explore the Human Resource Dataset
HRanalytics <- read.csv("HR_analytics.csv", header = TRUE)
Always check for null values. In this example, we have specifically choosen a clean dataset (so there are no null values). However, not all dataset are clean and NA values can be a problem. When dealing with NA values you are left with 3 options keep it, remove it, or replace it. There are several techniques for handling null values, but we will cover them later down the line.
sum(is.na(HRanalytics))
Let’s start making comparisions:
summary(HRanalytics)
## satisfaction_level last_evaluation number_project average_montly_hours
## Min. :0.0900 Min. :0.3600 Min. :2.000 Min. : 96.0
## 1st Qu.:0.4400 1st Qu.:0.5600 1st Qu.:3.000 1st Qu.:156.0
## Median :0.6400 Median :0.7200 Median :4.000 Median :200.0
## Mean :0.6128 Mean :0.7161 Mean :3.803 Mean :201.1
## 3rd Qu.:0.8200 3rd Qu.:0.8700 3rd Qu.:5.000 3rd Qu.:245.0
## Max. :1.0000 Max. :1.0000 Max. :7.000 Max. :310.0
##
## time_spend_company Work_accident left
## Min. : 2.000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 3.000 Median :0.0000 Median :0.0000
## Mean : 3.498 Mean :0.1446 Mean :0.2381
## 3rd Qu.: 4.000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :10.000 Max. :1.0000 Max. :1.0000
##
## promotion_last_5years sales salary
## Min. :0.00000 sales :4140 high :1237
## 1st Qu.:0.00000 technical :2720 low :7316
## Median :0.00000 support :2229 medium:6446
## Mean :0.02127 IT :1227
## 3rd Qu.:0.00000 product_mng: 902
## Max. :1.00000 marketing : 858
## (Other) :2923
## there are several ways of looking at grouped data [Fastest]
table(HRanalytics$sales)
##
## accounting hr IT management marketing product_mng
## 767 739 1227 630 858 902
## RandD sales support technical
## 787 4140 2229 2720
## dpylr
HRanalytics %>%
count(left)
## # A tibble: 2 × 2
## left n
## <int> <int>
## 1 0 11428
## 2 1 3571
## plotting
plot(HRanalytics$salary)
## pairs [Slowest]
pairs(HRanalytics)
Predictive Analytics) Employee Retention
Training and Testing
##20, 80% split
train=HRanalytics[1:12000,]
test=HRanalytics[12001:nrow(HRanalytics),]
Decision Trees
fit = rpart(left~satisfaction_level+last_evaluation+number_project+average_montly_hours+time_spend_company+promotion_last_5years+sales+salary,data=train,method="class")
pred = predict(fit,test)
summary(fit)
## 0 = stayed, 1 = left
rpart.plot(fit, extra = 2, shadow.col ="gray")
Now we need to test its accuracy
printcp(fit)
##
## Classification tree:
## rpart(formula = left ~ satisfaction_level + last_evaluation +
## number_project + average_montly_hours + time_spend_company +
## promotion_last_5years + sales + salary, data = train, method = "class")
##
## Variables actually used in tree construction:
## [1] average_montly_hours last_evaluation number_project
## [4] satisfaction_level time_spend_company
##
## Root node error: 2000/12000 = 0.16667
##
## n= 12000
##
## CP nsplit rel error xerror xstd
## 1 0.16400 0 1.0000 1.0000 0.0204124
## 2 0.06375 3 0.4220 0.4220 0.0140057
## 3 0.05400 6 0.2270 0.2290 0.0104943
## 4 0.02000 7 0.1730 0.1750 0.0092167
## 5 0.01700 8 0.1530 0.1685 0.0090490
## 6 0.01050 9 0.1360 0.1395 0.0082540
## 7 0.01000 10 0.1255 0.1330 0.0080639
Random Forest
fit2 <- randomForest(left~satisfaction_level+last_evaluation+number_project+average_montly_hours+time_spend_company+promotion_last_5years+sales+salary,data=train)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
print(fit2)
##
## Call:
## randomForest(formula = left ~ satisfaction_level + last_evaluation + number_project + average_montly_hours + time_spend_company + promotion_last_5years + sales + salary, data = train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 0.01532125
## % Var explained: 88.97
importance(fit2)
## IncNodePurity
## satisfaction_level 526.17718
## last_evaluation 194.39069
## number_project 295.31888
## average_montly_hours 246.58314
## time_spend_company 281.10927
## promotion_last_5years 1.69423
## sales 23.11345
## salary 13.82964
plot(fit2)
fit3 = rpart(left~satisfaction_level+number_project+average_montly_hours+time_spend_company,data=train,method="class")
pred = predict(fit,test)
## 0 = stayed, 1 = left
rpart.plot(fit3, extra = 2, shadow.col ="gray")
Overall, based on what we’ve learned from our tree models is that satisfaction_level plays the highest factor in predicting whether a employee leaves or not. It is then followed by number of projects given, time spent at the company, and average_monthly_hours. We know that employees would start to leave under these conditions. 4 or more projects with low satisfaction and 3 projects with extremely low satisfaction. The other condition is people working on less projects, but have incurred alot of overtime.
Data Visualization
GGPlot
ggplot(HRanalytics, aes(x = satisfaction_level)) +
geom_density(aes(colour=salary, fill=salary, alpha=.5))
ggplot(HRanalytics, aes(satisfaction_level, average_montly_hours, left))+ geom_density2d(aes(color = factor(left)))