Description

Instructions: You should write your R code inside a new .ipynb file. For each problem below you should make a new markdown cell and clearly paste the problems below. After you have pasted the problem into a new markup cell, you can then add a new R code cell and complete your programming to fulfill what the question asks you to. When the question asks you for some analysis you should add a new markup cell and write your response there.

1: (a) Select he following column vectors of interest: Age, Race1, Education, Poverty, BMI, Pulse, BPSysAve, PhysActive, Diabetes.

(b) Clear out any NA values using na.omit and detail the number of valid rows of data for each gender ‘male’ and ‘female’ by plotting the histogram of the gender column vector.

(c) Is this distribution even and if not, why do you think this is the case?

2: (a) Build a multiple-logistic regression classifier (using glm function) on the Diabetes vector with a 10% test split

(b) Build a random forest classifier (using randomForest library) with a 10% split. Which model has better accuracy?

3: (a) Now repeat part [2] but do it for each gender separately by partitioning on the gender column vector.

(b) Which is the better classifier now and are the separate gender accuracy measures the same or is one better (make sure to use the baseline distribution of each gender to support your finding)?

(c) Why do you think that is the case?

4: (a) now repeat part [3] but do it using the SVM methods (with the e1071 library).

(b) Were you able to improve upon any of the gender-specific models with SVM?

(c) Why do you think that is the case?