Anja's Machine Learning

Thank you for visiting my website!

Anja Wu

Machine Learning

I recently completed the Introduction to Machine Learning run by the Vector Institute with Juan Felipe Carrasquilla Álvarez as the professor. I really enjoyed the course and learned so much!

All the assignments have been explained and a link to the GitHub section provided. I also summarized my main lecture take aways!

Projects

Assignments

This section has the best take aways from each assignment as opposed to just posting the whole assignment. I link to the GitHub repo that has all of the code posted.

k-Nearest Neighbours Assignment
Here is the link to the GitHub code part of the assignment:
- "Anja_H1_Question_2_KNN_2Features.ipynb": has the analysis and selection of the "k" hyperparameter in the Iris database from sklearn focusing on just two features (the sepal ones) to predict classification of three species.
- "Anja_H1_Question_2_KNN_4Features.ipynb": has the above k-Nearest Neighbours analysis but done using all four features for the prediction of the species classification.
Highlights of Learning
Steps
1. Defining a data set from the Iris dataset (did 2 then 4 features):
  Taking the first 2 features from the data matrix:
  X = iris.data[:, :2]
  y = iris.target # The class labels
  Extending the matrix to all four features:
  X = iris.data[:, :4]
  y = iris.target
2. Splitting training and test set:
  
  from sklearn.model_selection import train_test_split
  We set 20% of the dataset as the test set, and 80% as the training set
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
3. Creating scatter plot to visualize (two different sets):
  For the first two features
  Making a scatterplot for two features (sepal) and three species:
  f, axs = plt.subplots(figsize=(8,6))
  the_scatter = axs.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1, edgecolor='k')
  Change the legend name of the class to be Iris species like Setosa, Versicolour, and Virginica:
  lines, legend_names = the_scatter.legend_elements()
  legend1 = axs.legend(lines, ['Setosa', 'Versicolour', 'Virginica'], title="Classes")
  axs.add_artist(legend1)
  
  For the last two features
  Making a scatterplot for two features (petal) and three species:
  f, axs = plt.subplots(figsize=(8,6))
  the_scatter = axs.scatter(X[:, 2], X[:, 3], c=y, cmap=plt.cm.Set1, edgecolor='k')
  
  As can be seen two of the features (petal) give a lot more information in the boundary decisions. We would expect when we include them for our predictions to be done well.
4. Creating function that will print out the features from the integers in the dataset:
  Due to the fact that the categorical variables needed to be encoded, we must pull out the type in order for the values to be interpretable in our data. So we define a function print_features(x,y) that will print out the Iris type and sepal width/length, in a clean format.
  def print_features(x,y):
  print('Iris type:', ['Setosa', 'Versicolour', 'Virginica'][y-1])
  print('Sepal Length: %.1f \t Sepal Width: %.1f'%(x[0], x[1]))
5. Creating Euclidean distance function:
  This was done for fun to check if the same neighbours would be returned as using the k-nearest neighbour function from sklearn (to further understand what was occuring "under the hood" for the algorithm.
  def euclidean_distance(x1, x2):
  distance = sqrt(np.sum((x2-x1)**2))
  return distance
6. Printing out nearest neighbours:
  Using my Euclidean distance formula
  First, we randomly select a test example (#11) and print features so we can compare to the nearest neightbours:
  sample = X_test[10]
  print('Test Sample:')
  print_features(sample, y_test[10])
  Next, we calculate the euclidean distance to this test example:
  distances = []
  for i, row in enumerate(X_train):
  distance = euclidean_distance(sample, row)
  # print(f"{i}: {distance} from x1: {sample} and x2: {row}") # Checking calculations above for correctedness
  distances.append((i, distance))
  distances.sort(key=lambda tup: tup[1]) # sorting distances
  Finally, we print the closest 2 neighbours with their features for comparison:
  k = 2 # Number of nearest neighbors
  print('\nTop %d Nearest Neighbors:' % k)
  for nn in range(k):
  print_features(X_train[distances[nn][0]], y_train[distances[nn][0]])
  
  Using sklearn k-NN
  Import function from library:
  from sklearn.neighbors import KNeighborsClassifier
  Just like before, we select test example #11 and print features so we can compare to the nearest neightbours. Then we use the sklearn function KNeighborsClassifier() to specify looking at the top 5 neighbours:
  k=5
  neigh = KNeighborsClassifier()
  neigh.fit(X_train, y_train)
  dists, neighbor_ids = neigh.kneighbors(X=[sample], n_neighbors = 5)
  Finally, we print the closest 5 neighbours with their features for comparison:
  print('\nClosest 5 neighbors to this test sample:')
  for knn in range(5):
  print('\nNeighbor %d ===> distance:%f'%(knn, dists[0][knn]))
  print_features(X_train[neighbor_ids[0][knn]], y_train[neighbor_ids[0][knn]])
  
  From the results, both produced the same top 2 nearest neighbours. This makes sense because in reading the documentation for KNeighborsClassifier() the metric is technically Minkowski, but when p=2 (power parameter) it is equivalent to Euclidean metric. Since the default is p=2, we know that the calculations should be the same.
7. Create heatmap to analyze the training set distance from the test set distance:
  Creating list to store distances between the test set and the training set
  distances = []
  for x_test in X_test:
  distance = np.sum((x_test[np.newaxis, ...] - X_train) ** 2, axis=1)
  distances.append(distance)
  
  Creating a "colorbar" graph to display distance from each training example to each test example
  distances = np.array(distances)
  plt.figure(figsize=(50, 20))
  plt.imshow(distances)
  plt.colorbar()
  plt.xlabel('Training examples id', fontsize=40)
  plt.ylabel('Test examples id', fontsize=40)
  plt.xticks(np.arange(0, 120, 2), fontsize=20)
  plt.yticks(np.arange(0, 30, 1), fontsize=20) plt.show
  
  Note: a good reason to check this is because we want the distances between the sets to be small. This is when k-Nearest Neighbours works best. If the heatmap shows a lot more further distances, this means the sample was not done well.
8. Creating a line graph to show test accuracy based on k:
  Creating list of all test accuracies based on changes in k
  test_accs = []
  for k in range(1, X_train.shape[0]):
  # Create K nearest neighbors classifier
  neigh = KNeighborsClassifier(n_neighbors=k)
  neigh.fit(X_train, y_train)
  
  # Prediction
  y_pred = neigh.predict(X_test)
  
  # Calculate accuracy
  acc = (y_pred == y_test).mean()
  test_accs.append(acc)
  
  Creating a line graph to depict accuracy to be able to find the best hyperparameter (in this case, k value) visually:
  plt.figure(figsize=(30, 10))
  plt.plot(list(range(1, X_train.shape[0])), test_accs)
  plt.axhline(y=0.93, color='r', linestyle='-')
  plt.axvline(x=29, color='b', linestyle='--')
  plt.xlabel('Number of nearest neighbors (k)')
  plt.ylabel('Test set accuracy')
  plt.xticks(np.arange(0, 120, 2), fontsize=18)
  plt.yticks(np.arange(0, 1.0, 0.05), fontsize=18)
  plt.grid()
  plt.show
  For 2 features:
  
  We can see the accuracy of the test set decreasing the larger k-values we have due to the fact that we are underfitting the model on the training set leading to a decrease in finding the underlying relationship. At a low level of k, we can see that the accuracy performs more poorly due to overfitting on the training model and capturing more noise than the underlying relationship.
  
  For 4 features:
  
  As can be seen the more features that are added the more accurate the predictions.
*Note*
After this assignment we discussed the fact that the data should be divided into three sections: training set, validation set, and test set. This way the test set is only used once to check accuracy and the validation set is used to determine the best hyperparameters.
Decision Trees Assignment
Here is the link to the GitHub code part of the assignment:
- "AnjaWu_H2_DecisionTrees.ipynb": has a decision tree that was created from Kaggle real and fake news headlines.
Highlights of Learning:

Steps
1. Create data splitting function (into three sections: training, validation, and test set):
  We will define a function to split the data for us
  def split_data(X, y, train_size=0.7, val_size=0.15):
  Then we have to figure out the shape of our data to be able to split the data properly.
  total_data = X.shape[0]
  After we have the shape, we can specify the integers required for the size of each set based on the inputted data and the train_size and val_size specified.
  train_size = int(train_size * total_data)
  val_size = int(val_size * total_data)
  test_size = total_data - train_size - val_size
  Now we want to make sure that the data is randomized instead of being split sequentially:
  all_indices = np.random.permutation(np.arange(total_data))
  Then we will split all the indices based on the size specified (.70 for training, .15 for val and .15 for test):
  train_indices = all_indices[:train_size]
  val_indices = all_indices[train_size:train_size + val_size]
  test_indices = all_indices[train_size+val_size:]
  Now we have to split our data into features and outcome. In this case, X being the words vectorized, y being the real or fake category.
  train_X, train_y = X[train_indices], y[train_indices]
  val_X, val_y = X[val_indices], y[val_indices]
  test_X, test_y = X[test_indices], y[test_indices]
  Finally, we need to have our function return a dictionary so that it outputs our split datasets for all 3 along with the X and y.
  return {
  'train': (train_X, train_y),
  'val': (val_X, val_y),
  'test': (test_X, test_y)
  }
2. Vectorizing titles from news headlines:
  Since the inputted data is words from headlines that then need to be classified by real or fake news, we need to vectorize the input in order to be able to breakdown the title into words.
  
  So we create a function to read the data and vectorize it, load_data.
  
  def load_data(paths):
  After we have loaded the data, we need to convert the text to a matrix of token counts using CountVectorizer(). We also want to create a list for each title (to store words) and then a list to count how many words lines we have.
  vec = CountVectorizer(input='content')
  lines = []
  counts = []
  Then we want to go through each line in the path, count how many there are and add each line to a list. .readlines() returns a list that has each line from the file. .extend() allows the appending of more than one element to a list.
  for p in paths:
  with open(p) as f:
  file_lines = f.readlines()
  counts.append(len(file_lines))
  lines.extend([l.strip() for l in file_lines])
  
  .fit() returns a vector corresponding to each vocabulary dictionary from all headlines. .transform() takes every line from the files makes them into a matrix. So we can combine those two using .fit_transform(). Then we make it into an array for ease of use.
  
  data_matrix = vec.fit_transform(lines).toarray()
  
  Because the inputted data was in the form of real news titles first and then fake news titles next, we can do the following. An array of 0s is made for labels for real data and and array of 1s for labels of fake data. Then np.concatenate combines the labels of real and fake news into one array.
  
  y = np.concatenate((np.zeros(counts[0]), np.ones(counts[1])))
  
  Then we us our split_data() function to take the data_matrix we made above and puts it as X and the real/fake label as y and .get_feature_names_out() maps every single word that appears in all the headers and creates a vector.
  
  return split_data(data_matrix, y), vec.get_feature_names_out()
  
  Using the above function, we pull out the data and features from the two text files: real news titles and fake news titles.
  
  data, feature_names = load_data(['/content/clean_real.txt', '/content/clean_fake.txt'])
3. Creating accuracy calculation function
  
  We create a function to compute the accuracy of a given model on input data X and label t.
  
  def get_acc(model, X, t):
  y_pred = model.predict(X)
  y_test = t
  acc = (y_pred == y_test).mean()
  return acc
4. Checking max depth of trees based on criterion
  
  We define a decision tree with the criterion of choice (need to test for each criterion you want as you can get different depths) and print out the max_depth so we have some idea of the depth we should be testing for the best depth hyperparameter. Remember that the tree must be for onto our training dataset. Since the fit takes X and y, we can write * to "unpack" the data['train'] dictionary to get both the X and y lists.
  
  clf = tree.DecisionTreeClassifier(criterion = 'entropy')
  clf = clf.fit(*data['train'])
  print(clf.tree_.max_depth)
5. Creating loop to check optimal depth and criterion
  
  We create a list of depths we want to explore. These are selected with reasonable spacing, not exceeding the max_depth by much.
  
  depths = [1, 5, 8, 10, 15, 30, 35, 45, 50, 55, 60, 65, 70, 75, 80, 90, 100, 105, 140]
  
  Since we are not sure whether we want to use entropy or Gini as the criterion, we must test both. For Entropy: train the models with the entropy criterion then we list depths and data as variables so when we loop through we can test different depths and use not our training and validation datasets.
  
  res_entropy = select_model(depths,data, "entropy")
  
  Since we want to track the best depth and best accuracy, we create variables and set them to nothing.
  
  best_d_entropy = None
  best_acc_entropy = 0
  
  Then we loop over the various depths, while training the model on the training set and checking the accuracy on the validation set. different models and accuracies to find the optimal model according to its validation accuracy
  for d in res_entropy:
  val_acc = res_entropy[d]['val']
  print("Depth: {} Train: {} Val: {}".format(d, res_entropy[d]['train'], val_acc))
  if val_acc > best_acc_entropy:
  best_d_entropy = d
  best_acc_entropy = val_acc
  
  Now we can run the exact same process but this time using our select_model() function we can specify "gini":
  res_gini = select_model(depths,data,"gini")
  
  We would end up with an output like this
6. Plotting depth and criterion to visually be able to select best based on accuracy of val and train set
  
  First, we have to store all the accuracy values for both the training and validation sets.
  
  entropy_val = [] entropy_train = []
  depth_val = []
  
  Then we create a loop to store every accuracy value for every depth value in the list specified above.
  
  for d in res_entropy:
  entropy_val.append(res_entropy[d]['val'])
  entropy_train.append(res_entropy[d]['train'])
  depth_val.append(d)
  
  We can do the exact same thing for the gini criterion.
  
  We want to calculate the absolute minimum and maximum of the accuracies to be able to set the Y-axis for the graph.
  
  minimum_acc = min(min(gini_train), min(gini_val), min(entropy_train), min(entropy_val))
  maximum_acc = max(max(gini_train), max(gini_val), max(entropy_train), max(entropy_val))
  
  Next, we can use np.argmax() to get the corresponding depth (x-valur) for the maximum validation set accuracy. This is used to make a vertical line to show us where the maximum accuracies occur.
  
  entropy_max_x = depth_val[np.argmax(entropy_val)] gini_max_x = depth_val[np.argmax(gini_val)]
  
  Next, we can create the actual line graph comparing Entropy and Gini criterion, with varying depths.
  
  plt.figure(figsize=(30, 10))
  plt.plot(depth_val, entropy_train, label = "IG Train Dataset", color="rebeccapurple", linewidth=4)
  plt.plot(depth_val, entropy_val, label = "IG Val Dataset", color="mediumorchid", linewidth=4)
  plt.plot(depth_val, gini_train, label = "Gini Train Dataset", color="seagreen", linewidth=4)
  plt.plot(depth_val, gini_val, label = "Gini Val Dataset", color="lightgreen", linewidth=4)
  
  plt.title('Figure 1: Accuracy Comparison ', fontsize=25)
  plt.xlabel('Depth Value', fontsize=20)
  plt.ylabel('Accuracy', fontsize=20)
  plt.legend(bbox_to_anchor=(0.8, -0.1), ncol=4, fontsize=20)
  
  # use np.arange() to space out the axes evenly.
  plt.xticks(np.arange(0, max(depth_val), 5), fontsize=18)
  plt.yticks(np.arange(minimum_acc, maximum_acc, 0.02), fontsize=18)
  
  # Looking for x and y coordinates for the maximum accuracy for both models
  plt.axhline(y = max(entropy_val), color='r', linestyle='dashed')
  plt.axvline(x=entropy_max_x, color='r', linestyle='dashed')
  plt.axhline(y = max(gini_val), color='g', linestyle='dotted')
  plt.axvline(x=gini_max_x, color='g', linestyle='dotted')
  
  plt.grid()
  plt.show
  
  We end up with
7. Testing test accuracy for selected model
  
  Once we have run the above several times we can find the best hyperparameters to test the model against the test set.
  
  test_model = DecisionTreeClassifier(criterion = 'entropy', max_depth = 80)
  test_model = test_model.fit(data['train'][0], data['train'][1])
  get_acc(test_model, *data['test'])
8. Visualizing decision tree
  
  I did this with two different ways: using tree.plot_tree and the other editing the .dot file that was produced from tree_viz. I found a great resource that goes through several ways to visualize trees.
  1. Using plot_tree function from sklearn tree package
  2. Using .dot file edits
Regression and Classification
Here is the link to the GitHub code part of the assignment:
- "AnjaWu_HW3_code-Regression&Classification": has the creation of linear and logistic regression analysis on health data analyzed to predict hospital charges. Most code was done by a TA in the course, however, I chose to comment the code to further my understanding on how to build regression models using Scikit Learn.
- "AnjaWu_EntropyCalculations": has code that I made to calculate entropy from the assignment, rather than doing them all by hand.
Highlights of Learning:

Steps
1. Check data structure:
  
  First, print out the number of rows and columns in the dataset
  
  print("Dimensionality of the DataFrame:", df.shape)
  
  Then, we can print out summary statistics for each variable in the dataset, just to have a broad overview of the data.
  
  df.describe()
  
  Print out the data type of each feature in the dataset.
  
  print("Data type of each feature:")
  df.dtypes
  
  Determine if there are any missing datapoints or duplicate rows in the dataset.
  
  print("\nAre there any missing datapoints in the dataset?", df.isnull().values.any())
  print("Number of duplicated rows:", df.duplicated().sum())
  
  Remove the duplicate row from the dataset (keeping the first iteration).
  
  df.drop_duplicates(keep='first', inplace=True)
  
  Confirm the final number of rows and columns in the dataset after data cleaning.
  
  print("df.shape =", df.shape)
  print("Number of rows =", df.shape[0])
  print("Number of columns =", df.shape[1])
2. Check data distribution for outcome variable:
  
  Plot a histogram to show the distribution of our outcome variable: the "charges" variable. kde = True draws a line estimate of distribution using the kernel density estimate.
  
  import seaborn as sns
  
  sns.histplot(df['charges'], kde=True)
  
  Then we can use The Shapiro-Wilk test to test the null hypothesis that the data was drawn from a normal distribution. The distribution must be normal prior to doing linear regression.
  
  from scipy import stats
  
  print(stats.shapiro(df['charges']));
  
  The Shapiro-Wilk result is approximately 0.814 with extremely small p-value, therefore, the data was most likely NOT drawn from a normal distribution (enough evidence to reject null).
  
  **We need to have the distribution be normal in order to do a linear regression, so we must transform the data.**
  
  In order to transform the target variable to a normal distribution we can use the .boxcox(). The function transforms the data using Box-Cox power. We must specify the first index [0] because stats.boxcox returns a tuple (Box-Cox power transformed array & lambda max log-likelihood) and we only want the first index.
  
  charges_transformed = stats.boxcox(df['charges'])[0]
  
  Then plot it to check.
  
  sns.histplot(charges_transformed, kde=True); plt.xlabel('charges (transformed)')
3. Check collinearity:
  
  Low collinearity implies independence of variables. We need independent variables in order to have the most accurate regression model. We can use a heatmap to check for collinearity between variables. The darker the value the higher the correlation.
  
  sns.heatmap(df.corr(), cmap='Blues', annot=True);
  
  Correlation coefficient values below 0.3 are considered to be weak; 0.3-0.7 are moderate; >0.7 are strong. In this case, all correlations are 0.3 or below, so we can conclude that the variables are independent from one another.
  **Independence of variables is a condition for regression.**
4. Plot the variables to analyze distribution:
  
  Plot the distribution of age, BMI, and number of children to see what kind of data we are working with.
  
  fig, (ax0, ax1, ax2) = plt.subplots(1, 3, figsize=(20,5))
  sns.histplot(x=df['age'], kde=True, ax=ax0);
  sns.histplot(x=df['bmi'], kde=True, ax=ax1);
  sns.countplot(x=df['children'], ax=ax2);
5. Changing BMI value to ordinal categorical labels:
  
  In order to get better understanding on the effect of BMI on charges, we can change the BMI value to the ordinal categorical labels normally used: (1) underweight; (2) normal weight; (3) overweight; and (4) obese. We want to use np.select() to returns array from input (condition) and output (labels) to add as a column in the original dataframe.
  
  conditions = [(df['bmi'] < 18.5),
  (df['bmi'] >= 18.5) & (df['bmi'] < 25),
  (df['bmi'] >= 25) & (df['bmi'] < 30),
  (df['bmi'] >= 30)]
  labels = ['underweight', 'normal weight', 'overweight', 'obese']
  df['bmi_categories'] = np.select(conditions, labels)
6. Plot the independent variables vs charges to see if there are any insights to be gained:
  
  fig, (ax0, ax1, ax2) = plt.subplots(1, 3, figsize=(20,5))
  sns.lineplot(x='age', y='charges', data=df, ax=ax0);
  sns.barplot(x='bmi_categories', y='charges', data=df,
  order=['underweight', 'normal weight', 'overweight', 'obese'], ax=ax1);
  sns.barplot(x='children', y='charges', data=df, ax=ax2);
  
  We can see that for BMI there is a statistically significant difference for an increase in charges for someone who is obese. Also there seems to be a statistical significance for a decrease in charges for people with 5 children. Otherwise, just by the graphs we cannot see statistically significant differences.
  
  Same thing down for the other 3 independent variables. Keep in mind that .pointplot() returns point estimates and confidence intervals.
  
  fig, (ax0, ax1, ax2) = plt.subplots(1, 3, figsize=(20,5))
  sns.pointplot(x='sex', y='charges',data=df, ax=ax0);
  sns.pointplot(x='smoker',y='charges', data=df, ax=ax1);
  sns.pointplot(x='region',y='charges', data=df, ax=ax2);
  
  Neither sex nor region seems to have a statistically significant difference for charges but smoker vs non-smokers does have a statistically significant difference to the charges incurred.
7. Encoding values
  
  Now that we have analyzed all the graphs we want to do a linear regression to predict the medical charges that a person will most likely incur based on the features given. In order to do that we have to transform our categorical features into numerical values. In our case, we will be using a simple label encoder that turns the target labels to values between 0 and n_classes-1 (used for ordinal categorical features).
  
  from sklearn.preprocessing import LabelEncoder
  
  encoder = LabelEncoder()
  df['sex_encoded'] = encoder.fit_transform(df['sex'])
  df['smoker_encoded'] = encoder.fit_transform(df['smoker'])
  df['region_encoded'] = encoder.fit_transform(df['region'])
  df['charges_transformed'] = stats.boxcox(df['charges'])[0]
8. Splitting data
  
  We define X (features) and y (target) and remove duplicate features that will not be used in the model. We will first split the dataset into different datasets: (1) training set; and (2) test set. We retain 10% of the data for testing, and use a random state value of "0".
  
  from sklearn.model_selection import train_test_split
  
  X = df.drop(['sex', 'smoker', 'region', 'charges', 'bmi_categories', 'charges_transformed'], axis=1)
  y = df['charges_transformed']
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)
9. Instantiate and fit the linear model on our training set to predict medical charges
  
  First, we instantiate a linear regression model
  
  from sklearn.linear_model import LinearRegression
  
  linear_model = LinearRegression()
  
  Then we fit the model using the training data and we can print out the intercept and coefficients for the linear regression model.
  
  linear_model.fit(X_train, y_train)
  
  print(linear_model.intercept_)
  print(linear_model.coef_)
10. Prediction y_test
  
  For each record in the test set, make a prediction for the y value (transformed value of charges). The predicted values are stored in the y_pred array.
  
  y_pred = linear_model.predict(X_test)
11. Calculate MSE and MAE
  
  The metrics package in Python can derive the model evaluation metrics for mean squared error and mean absolute error along with the R^2 value.
  
  from sklearn import metrics
  
  print("Mean squared error (MSE) =", metrics.mean_squared_error(y_test, y_pred))
  print("Mean absolute error (MAE) =", metrics.mean_absolute_error(y_test, y_pred))
  print("R^2 =", metrics.r2_score(y_test, y_pred))
  
  The results we get are
  
  Mean squared error (MSE) = 0.3544682461419291
  Mean absolute error (MAE) = 0.39858557168153946
  R^2 = 0.8284407853175861
  
  A R^2 value of approximately 83% indicates that our model can explain 83% of the variance in the data, so it performs quite well! This means we can use our features to predict the approximate amount of medical costs (based on insurance claims) the individual is likely to incur.
12. Logistic model to predict smoker vs non-smoker
  
  Now we want to use a logistic model to predict smoker vs non-smokers (because it is a binary outcome). First we need to define the new df and split data - just like before.
  
  X = df[['age', 'bmi', 'children', 'sex_encoded', 'region_encoded', 'charges_transformed']]
  y = df['smoker_encoded']
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)
  
  Next, we instantiate a Logistic Regression model and name it "logit_model".
  
  from sklearn.linear_model import LogisticRegression
  
  logit_model = LogisticRegression()
  
  Fit the logistic regression model with the training data
  
  logit_model.fit(X_train, y_train)
  
  Next, we predict the label of smoker versus not for the test set.
  
  y_pred = logit_model.predict(X_test)
  
  Now, we generate the Confusion Matrix for this logistic regression model to check how well the predictions performed. The confusion matrix is created this time using the sklearn metrics package.
  
  from sklearn import metrics
  
  cm = metrics.confusion_matrix(y_test, y_pred, labels=logit_model.classes_)
  disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=logit_model.classes_)
  disp.plot(cmap='Blues')
  plt.show()
  
  The confusion matrix does a good job of visually displaying how well the prediction did but we can see numerically by using the metrics.classification_report().
  
  print(metrics.classification_report(y_test, y_pred, target_names=['non-smoker', 'smoker']));
Feed Forward Neural Network
Here is the link to the GitHub code part of the assignment.
- "AnjaWu_HW4_FeedForwardNN.ipynb": has a complete neural network (created by a TA and completed by myself), in which optimal hyperparameters are analyzed and selected.
Highlights of Learning:

Steps
1. Creating data set
  
  Specify the number of points and the number of branches (different colours of swirls).
  
  N = 50 # number of points per branch
  K = 3 # number of branches
  
  In order to count the number of points that will be contained in the training set we must multiply the number of points per branch with the number of branches.
  
  N_train = N*K
  
  Then we need to create the matrix for the data points and their labels.
  
  x_train = np.zeros((N_train,2)) # matrix containing the 2-dimensional datapoints
  t_train = np.zeros(N_train, dtype='uint8') # labels (not in one-hot representation)
  
  The mag_noise determines how close together and interwoven the various spiral colours are. The difference in theta will space apart the degrees between each spiral.
  
  mag_noise = 0.3 # controls how much noise gets added to the data
  dTheta = 4 # difference in theta in each branch
  
  Now that we have set up the variables, we can do the data generation. First, for all the branches we will specify the indices to be used in the dataset by multiplying the number of points by the number of branches. Then we will make evenly spaced points using .linspace() in a radius with numbers from 0.01 to 1. The theta determines the width (in terms of degree in radians) of the spiral colours. Then we concat (using np.c_) the cosine and sine of theta, which results as x,y coordinates giving us a special.The label of the spiral is given the number corresponding to the branch number.
  
  for j in range(K):
  ix = range(N*j,N*(j+1))
  r = np.linspace(0.01,1,N) # radius
  th = np.linspace(j*(2*np.pi)/K,j*(2*np.pi)/K + dTheta,N) + np.random.randn(N)*mag_noise # theta
  x_train[ix] = np.c_[r*np.cos(th), r*np.sin(th)]
  t_train[ix] = j
  
  Now, we plot the spirals.
  
  fig = plt.figure(1, figsize=(5,5))
  plt.scatter(x_train[:, 0], x_train[:, 1], c=t_train, s=40)
  plt.xlim([-1,1])
  plt.ylim([-1,1])
  plt.xlabel(r'$x_1$')
  plt.ylabel(r'$x_2$')
  plt.show()
2. Define the network architecture
  
  We have to instantiate a Feed Forward Neural Network class. We define the initialization of the class by using super(). super() is calling to inherit a lot of functionality from a feed-forward NN that is pre-defined in torch.
  
  import torch
  
  class FeedforwardNN(torch.nn.Module):
  def __init__(self, input_size, hidden_size, output_size):
  super(FeedforwardNN, self).__init__()
  
  Then we have to specify the sizes which come from the input of the initialization.
  
  self.input_size = input_size
  self.hidden_size = hidden_size
  self.output_size = output_size
  
  Next, we have to specify the flow of the layers and the type of transformation for the neural network.
  
  self.linear1 = torch.nn.Linear(self.input_size, self.hidden_size)
  self.linear2 = torch.nn.Linear(self.hidden_size, self.output_size)
  
  Then we have to define the function by calling the activation function from the torch library.
  
  self.relu = torch.nn.ReLU()
  self.sigmoid = torch.nn.Sigmoid()
  self.softmax = torch.nn.Softmax()
  
  After that, we need to define how our neural network propagates. Since we want a feed-forward network, we make a definition of a forward() function. In this function we define which activation function should be used in the hidden and output layers.
  
  def forward(self, x):
  #Layer 1:
  linear1_out = self.linear1(x)
  h1 = self.softmax(linear1_out)
  
  #Layer 2:
  linear2_out = self.linear2(h1)
  h2 = self.softmax(linear2_out)
  
  #Network output:
  y = h2
  
  Then to finish the propagation function we must return the final value after all the transformations.
  
  return y
  
  Here we specify the size that we want for our model and create the model.
  
  input_size = 2 #2D data
  hidden_size = 4
  output_size = K
  model = FeedforwardNN(input_size, hidden_size, output_size)
  
  Next, we store the input data as a PyTorch tensor.
  
  x_train = torch.tensor(x_train, dtype = torch.float)
  
  For best practice, it is a good idea to change the target values using one hot encoding.
  
  t_onehot = np.zeros((t_train.size, K))
  t_onehot[np.arange(t_train.size),t_train] = 1
  t_onehot = torch.tensor(t_onehot, dtype = torch.float)
  
  We must use backpropagation to minimize the cost function - in this case we use a learning rate of 1 and the stocastic gradient descent algorithm.
  
  learning_rate = 1 optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
  
  We define the cost function - in this case the cross entropy loss.
  
  cost_func = torch.nn.CrossEntropyLoss()
  
  Finally, we define the number of times to run gradient descent (epochs).
  
  N_epochs = 55000
3. Training the model
  
  Create lists to store the epoch number, the cost, and the accuracy of every run.
  
  epoch_list = []
  cost_training = []
  acc_training = []
  
  We then create a plot function that will update with each training run.
  
  def updatePlot():
  
  First, we want to design the graph. We want to have the x-value minimum and maximum defined based on each set, with padding. Then we specify the grid spacing.
  
  padding = 0.1
  spacing = 0.02
  x1_min, x1_max = x_train[:, 0].min() - padding, x_train[:, 0].max() + padding
  x2_min, x2_max = x_train[:, 1].min() - padding, x_train[:, 1].max() + padding
  x1_grid, x2_grid = np.meshgrid(np.arange(x1_min, x1_max, spacing), np.arange(x2_min, x2_max, spacing))
  
  We want to generate coordinates covering the whole plane. We do this by using .ravel() that takes the x value grid and turns it into a 1D array., which is then stored as an x, y coordinate in a tensor format. Then this is inputted into our feed-forward nn model. From there, we take the the output given from the model and detach it from the tensor format to be used as an array and get the final index (using np.argmax()). This let's us get the prediction of the last prediction in the array.
  
  torch_input = torch.tensor(np.c_[x1_grid.ravel(), x2_grid.ravel()], dtype = torch.float)
  NN_output = model(torch_input)
  predicted_class = np.argmax(NN_output.detach().numpy(), axis=1)
  
  Now we can get to the plotting. We want to plot 3 different things:the classifier (showing the boundaries based on the predictions), the cost function with each run, and the training accuracy for each run.
  
  First, for the classifier: we must create a subplot (number of rows, number of columns, position). Then we use .contoured() to draw the boundaries, using the grid and the predicted values). We specify a blend (transparency) level for the colours (alpha =.8).
  
  plt.subplot(121)
  plt.contourf(x1_grid, x2_grid, predicted_class.reshape(x1_grid.shape), K, alpha=0.8)
  plt.scatter(x_train[:, 0], x_train[:, 1], c=t_train, s=40)
  plt.xlim(x1_grid.min(), x1_grid.max())
  plt.ylim(x2_grid.min(), x2_grid.max())
  plt.xlabel(r'$x_1$')
  plt.ylabel(r'$x_2$')
  
  Next, we plot the cost function during training. This is done by calling upon the epoch_list we created along with the cost_training list.
  
  plt.subplot(222)
  plt.plot(epoch_list,cost_training,'o-')
  plt.xlabel('Epoch')
  plt.ylabel('Training cost')
  
  Finally, we plot the accuracy function during training. This is done by calling upon the epoch_list we created along with the acc_training list.
  
  plt.subplot(224)
  plt.plot(epoch_list,acc_training,'o-')
  plt.xlabel('Epoch')
  plt.ylabel('Training accuracy')
  
  Now to get the value from our three lists above (epoch_list, cost_training and acc_training), we create a for loop. We will loop through every epoch from 0 to our pre-specified max epochs.
  
  for epoch in range(N_epochs):
  
  First, we need to set the gradients to zero because PyTorch accumulates the gradients, we do this by using optimizer.zero_grad(). We then input our model and the outcome for our cost_func and then using .backward() we compute the gradient. Then after this has been completed we must update the parameters, using optimizer.step().
  
  optimizer.zero_grad()
  NN_output = model(x_train)
  cost = cost_func(NN_output, t_onehot)
  cost.backward()
  optimizer.step()
  
  Since it would overload the computer we will set the plot to update and print results every 500 epochs. Just like before we define the predicted_class by using the max of the model output array. The accuracy we calculate by taking the mean of the predictions. Finally we update our 3 lists to contain the values for every 500 epochs.
  
  if epoch % 500 == 0:
  predicted_class = np.argmax(NN_output.detach().numpy(), axis=1)
  accuracy = np.mean(predicted_class == t_train)
  
  epoch_list.append(epoch)
  cost_training.append(cost.detach().numpy())
  acc_training.append(accuracy)
  
  Then we update the plot of the resulting classifier.
  
  fig = plt.figure(2,figsize=(10,5))
  fig.subplots_adjust(hspace=.3,wspace=.3)
  plt.clf() # clears the figure
  updatePlot()
  display.display(plt.gcf()) # displays the current figure
  print("Iteration %d:\n Training cost %f\n Training accuracy %f\n" % (epoch, cost, accuracy) )
  display.clear_output(wait=True)
  
  I created lists to record: max accuracy, max accuracy epoch value, max accuracy cost value. These lists were written here to collect data. This was used to figure out the best parameters.
  
  accuracy_max = max(acc_training)
  max_index = acc_training.index(max(acc_training))
  epoch_max = epoch_list[max_index]
  cost_max = cost_training[max_index]
  
  The same was done for the purpose of getting more information for each parameter changed in my part of the assignment. Here is a sample:
  
  alpha6_final_accr.append(accuracy)
  alpha6_max_accr.append(accuracy_max)
  alpha6_max_accr_epoch.append(epoch_max)
  alpha6_final_cost.append(cost)
  alpha6_max_accr_cost.append(cost_max)
  
  Here is what was displayed:
4. Changing hyperparameters: activation functions
  
  I did every permutation for all three activation functions, Softmax, Sigmoid, ReLU, for both layers. However, that was done more for curiosity's sake. The softmax is used in the final output layer when there is a multi-class problem, such as this one. Sigmoid is used best when looking at probability and ReLU should only be used in the output layer if there are two outcomes.
  
  I found some interesting results:
  
  Analyzing data for max of softmax, sigmoid, relu (second function being softmax):
  The maximum occurs at: sigmoid_max_accr
  with a maximum accuracy of: 0.8593333333333334
  
  The same was done with the second function being sigmoid
  The maximum occurs at: sigmoid_max_accr
  with a maximum accuracy of: 0.716
  
  The same was done with the second function being relu
  The maximum occurs at: relu_max_accr
  with a maximum accuracy of: 0.6839999999999999
  
  In conclusion, on average it can clearly be seen that the activation combination that works best is:
  - First layer activation function: sigmoid
  - Second layer activation function: softmax
  with accuracy of 0.859 on average.
  
  As can be seen above, it also has the least cost for the maximum accuracy (average of 0.724206). However, it also did have the most amount of "optimization runs" needed (epoch = 8000).
5. Changing hyperparameters: learning rate
  
  In changing the learning rate, there were several values tried: 0.1,0.5, 1, 2, 3, 4. As a summary, we see:
  
  The maximum occurs at: 4_max_accr with a maximum accuracy of: 0.828
  
  In conclusion, as can be seen, the learning rate of 4 (0.828000) had the best accuracy, followed by 1. 4 also had the lowest cost (0.735654) and was in the bottom 3 for epoch runs (23050).
6. Changing hyperparameters: number of neurons in hidden layer
  
  Number of neurons in the hidden layers was changed to include: 2, 4, 6, 10, 14, 15, 16, 17, 20. Here are the results for the accuracy based on the number of neurons in the hidden layer.
  
  A scatterplot of number of neurons vs accuracy was created for all 10 trials to analyze the variation and accuracy to help determine the best.
  
  This graph was then condensed to contain the mean for all 10 trials along with error bars. In order to make CI indicators, to show how variant the avgs are from the other points, plt.errorbar() was used with the yerr being the 'ci' that I calculated per neuron:
  
  plt.errorbar(x_nodes_avgs, avg_df.loc['mean'], yerr=avg_df.loc['ci'], color='red', linestyle='')
  
  plt.show
  
  As can be seen, 16 neurons in the hidden layer did the best, on average. Close behind was 17 and then 10. 16 also had less variation than the other Q values. It is statistically better (less noise) than the surrounding number of neurons.
7. Changing dataset details: magnitude of noise
  
  The magnitude of noise in the data that was analyzed was: 0.1, 0.2, 0.3, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0. it is expected the more noise you have the harder classification will be. But I was curious as to what extent the learning accuracy decreased. For 10 trials run for each magnitude here is the accuracy:
  
  Like before, a scatterplot of number of neurons vs accuracy was created for all 10 trials to analyze the variation and accuracy to help determine the best.
  
  Then same as before, the above graph was condensed to include error bars to show variance.
  
  As stated before, obviously the algorithms did better with less noise in the data. There was a large dip after 0.6 magnitude and it is interesting to note that with 0.9 magnitude the algorithms did better than surrounding (and it also had the least variability in the data). I'm not quite sure why that is, but it was an interesting fact from the data collected.
8. Changing dataset details: number of labels
  
  The number of labels in the dataset that was chosen to analyze was 2, 3, 4, 5, 6, 7. Like before, the more labels you have the harder it will be for the model to predict accurately, so I wanted to explore how the accuracy changed between the number of labels.
  
  Plotting the scatterplot of number of labels vs accuracy for 10 trails per level:
  
  Then same as before, the above graph was condensed to include error bars to show variance.
  
  As expected, the training accuracy severely decreases as the number of labels increases, but the pattern was linear. The accuracy could probably be improved by increasing the number of hidden layer neurons, and increasing epochs even more. I wanted to keep the majority of variables the same throughout my trials in order to be able to compare the change in labels (having to adjust epochs and learning rate to get optimal results for the rest of the variables being held the same).
HR Attrition Analysis (Logistic Regression and Decision Tree)
Here is the link to the GitHub code part of the assignment.

Here is the final written report!

Summary of Assignment:

My insights:
From the logistic model, the random oversampling method performed the best in terms of predictions (accuracy range of 0.59-0.65). When looking at the features, there were certain ones that cannot be controlled by the company such as: age, years at company, number of companies worked, and marital status. However, there were features that could provide insights in which the company could act on to help keep more employees.
- In terms of performance and salary: people with a higher performance are more likely to leave but people with a higher percent salary hike are less likely to leave. Meaning the company should reward those high impact individuals who they would like to keep.
- Along this thread, it was found that the longer it has been since a person has been promoted, the more likely they are to leave. Thus it would be important to dig in individual cases whether the company is hiring externally when they should have promoted someone from within.
- The more training an individual had the more likely it was that they would stay. This means the company should encourage professional development.
- Sales executives and research directors tend to be more likely to leave, so it would be worthwhile to flag it to their leadership and look into the why for these groups.
- Individuals who travel more, tend to leave more. Given the lack of information on the company, it would be hard to come up with a reasonable action for the company to take in this regard.
- The last main actionable insight was: the worse an individual's work-life balance was, the more likely they were to leave. The company should ensure they are encouraging days off when needed, in order to retain their employees.
The decision tree performed really well at predicting (accuracy range of 0.89-0.95) regardless of the imbalanced sampling method chosen. This can be used to predict which employees might leave within the next year and try to keep them on.

My process:
Through the data processing stage there were several things done to ensure data was ready for machine learning algorithms. Initially, I had to encode the categorical features, both ordinal and nominal features. For the ordinal features, I chose a simple encoding of 0 to n-1 (for n features). For the nominal categories, I chose one hot encoding and dropped a specific column as the reference to prevent collinearity. After the feature encoding, I looked at the collinearity of the data and found a potential for multi-collinearity between some features. For this I ran the VIF (variance inflation factor) method and found that there were no values high enough to consider the variables collinear. After moving forward with the data, I noticed an imbalance in the attrition outcome. So with this I tried several methods to deal with it: random oversampling, random undersampling, SMOTE, and Tomek links for the two algorithms: logistic regression and decision tree. Logistic regression was used to see which feature(s) had the biggest effects on attrition and the decision tree was used to best predict attrition (due to it performing the best on imbalanced data). For the logistic regression, I used statsmodels in order to calculate the p-value for the coefficients to determine which were considered the most statistically significant effect on attrition. At the end of the models, I wrote a report comparing the methods and algorithms used plus the insights gained.

I only had about a week (part-time) to work on this and naturally that would mean there are improvements to be done. For instance, I would interpret the logistic regression coefficients in a more meaningful way by transforming the coefficients (e^coefficient) to determine how much more likely a person is to leave for each specific feature.

Class

Lecture Material

In the tabs below you will find a summary of the biggest take aways from each of the lectures. There was a lot of good material covered in the short time Professor Alvarez had but I just wanted to document the summary of the content.

Intro to ML
In machine learning there are different ways for a machine to learn:
- Supervised learning: you know correct input and output and you want program to learn mapping x -> y
- Unsupervised learning: given data find patterns/structure in data
- Online learning: supervised learning but when data is given in sequence
- Reinforcement learning: learning by trial and error in a way but with negative and positive reinforcements (e.g. in a game like tetris: you get scored based on your moves as a computer - the higher score, the bigger score you get - program the computer to understand the reinforcement)
- Active learning: supervised learning without labels, e.g. when you give an image and ask the oracle to label the picture
Steps:
1. ML workflow:
  Should I use ML on this problem?
  Is there a pattern to detect
  Can I solve it analytically?
  Do I have data?
2. Gather and organize data
3. Preprocessing, cleaning, visualizing
4. Establishing a baseline
5. Choosing a model, loss, regularization
6. Optimization
7. Hyperparameter search
8. Analyze performance and mistakes and iterate back to step 5 (or 3)
Important fact is to try simple models before moving to complex. Example: try logistic regression before building a deep neural net because it is cheaper computationally and easier to interpret.
k-NN
k-Nearest Neighbours

Nearest neighbours is a machine learning method in which a label is given to an unknown based on the feature mappings to the physically closest features of known labels. The mathematical algorithm looks like:

There are several different distance formulas that can be used. In the course, we discussed Euclidean. Euclidean is the same as Minkowski distance for the case where p=2. The Minkowski algorithm formula is:

In the case of p=2, we get the Euclidean distance:

Decision boundaries are used to classify points based on where they are located (according to the features). We can visualize (if in 2D) the decision boundaries using a Voronoi diagram. These decision boundaries are defined by the nearest neighbours algorithms. The k-nearest neighbours algorithm looks at k-amount of neighbours around a specific point and classifies that point based on the majority label of the k-neighbours:

The value of k is determined by the user. This is a hyperparameter which will affect the decision boundaries of the classification. It is important to find the ideal k value as:
- Too small of a k:
  - will give you a good reading of the fine-grained patterns
  - BUT can overfit the data: meaning it becomes too sensitive to fine-grained patterns that are only specific to the training model. This will lead to bad validation/test set accuracy.
- Too big of a k:
  - Will be less likely to be impacted by singularities
  - BUT can underfit the data: meaning it doesn’t pick up on smaller regularities. This will lead to not only bad validation/test set accuracy but also training set accuracy.
Curse of Dimensionality

This occurs with all high dimension data. In particular, for kNN most points are considered far apart and approximately the same distance. This can be seen in a proof where the mean of the distances is the dimension, meaning in high dimension the points are further apart and the standard deviation (sqrt(d)) becomes small relative to mean (d), meaning that the distances do not have much variation - implying approximately the same distance.

We can use dimension reduction techniques (more on this later) to reduce dimensions and allow our model to perform better.

Normalization

We have two types of regularization: L1 and L2. We will discuss this more in detail later. In a nutshell, for kNN they can be sensitive to large ranges of feature values. In order to combat this you can normalize each dimension to be zero.

Why kNN?
- Fewer hyperparameters
- Able to handle attributes that interact in complex ways
Decision Trees
Decision Tree

A decision tree makes predictions by splitting on features according to an algorithm to get the best split.

The structure of a decision tree:

There are two types of decision trees aka CART (Classification and Regression Tree):
1. Classification Tree
  - Output is discrete
  - Leaf value is the most common value within the leaf node set
2. Regression Tree
  - Output is continuous
  - Leaf value is the most mean value within the leaf node set
Learning

The decision tree splits according to a greedy heuristic. This means that at each split the algorithm splits according to the optimal choice for that specific node. The optimal choice is determined by the greatest reduction for the loss function. This is done recursively for each node until the leaf node is reached.

What algorithm do we use for our loss function?
Information Gain or Gini Impurity

Entropy/Information Gain

Entropy of discrete random variables is a number that quantifies the uncertainty of the outcomes. This means that a high entropy will not be able to be predicted well because the split of the data is closer to even (would look more like a uniform distribution). Whereas, a low entropy will give us more of a defined split in which we would have a greater chance of predicting where our data falls.

Entropy is calculated as:

Entropy of joint distributions:

Specific conditional entropy:

Conditional entropy:

Information Gain is what determines the split. The higher the information gain the better. Information gain is (1 - entropy) or for conditional cases:

Information gain favours smaller partitions (distributions) with a variety of diverse values.

Gini Impurity

Gini impurity calculates probability of diversity of the data. Gini index of 0 implies that the split is pure, this means that all the data falls in one division.

Mathematically calculated:

Gini Impurity favours bigger partitions (distributions).

Hyperparameters

There are many hyperparameters that can be changed to improve accuracy and to prevent overfitting and underfitting. Here are some:
- Max depth of tree: the lower this number, the more you risk underfitting the model.
- Minimum number of samples required to split an internal node: the lower this number, the more you risk overfitting to the training.
- Minimum number of samples required to be at a leaf node: the lower this number, the more you risk overfitting to the training.
- Purity at nodes: you can specify the lowest level of impurity that you want to split by. The lower the impurity level, the more you risk overfitting.
Why use decision trees?
- More interpretable
- Works well with lots of attributes (but only a few important ones)
- Works well with imbalanced data
- Fairly fast
Bias-Variance Decomposition
We can quantify the effects of underfitting and overfitting models in terms of bias/variance decomposition. We have the formula for calculating expected loss (using squared error), given by:

As can be seen:
- Bias: how wrong the expected prediction is (underfitting)
- Variance: the amount of variability in the predictions (overfitting)
- Bayes error: the inherent unpredictability of the targets
Note: Even though this analysis only applies to squared error, we often loosely use “bias” and “variance” as synonyms for “underfitting” and “overfitting”.

Low Bayes error makes predicting easier. Using a Venn diagram as an example, if the Bayes error is low the circles have little to no intersection:

If Bayes error is high, it is a lot harder to predict:

Really awesome visual for understanding effects of bias and variance:

Bayes optimal is defined as the best any learning algorithm can do. In order to calculate Bayes optimal, you need to have access to the true population distribution. This very rarely occurs in real life problems.
Ensembling Learning
Ensemble learning is a general meta approach to machine learning that seeks better predictive performance by combining the predictions from multiple models.

The three main classes of ensemble learning methods are bagging, stacking, and boosting.
- Bagging (aka Bootstrap aggregating): is an ensemble learning method that seeks a diverse group of ensemble members by varying the training data. The final model is made by combining ensemble members using simple statistics, such as voting or averaging. It involves fitting many decision trees on different samples of the same dataset and averaging the predictions. An example would be Random Forest (large number of individual decision trees split randomly that operate as an ensemble)!
- Stacking (Stacked Generalization): is an ensemble method that involves fitting many different models types on the same data and using another model to learn how to best combine the predictions.
- Boosting: is an ensemble method that seeks to change the training data to focus attention on examples that previous fit models on the training dataset have gotten wrong. It involves adding ensemble members sequentially that correct the predictions made by prior models and outputs a weighted average of the predictions.
Bagging

We cannot change Bayes error, as we have no control over it. The expected value of the average prediction is the same expected value of the individual, so the bias does not change either. We can see mathematically that bagging reduces the variance (overfitting of a model) when we average over independent samples:

The variance is 1/m smaller as we have taken the average. There is a proof that I did as an exercise that showed that bagging also reduces the variable even when the samples are not independent. It just does not reduce it as greatly as when they are independent.
Regression Models
Ideally, we want to have the following flow for our algorithms:
- Choose a model
- Define a loss function, this is used to quantify how bad the fit to the data is.
- Pick a regularizer, this minimizes the adjusted loss function and prevents overfitting or underfitting.
- After those, you can use an optimization algorithm, if need be. An optimization algorithm is a procedure that finds the parameters that result in an optimum.
We want to vectorize our data because it will increase the speed of computation due to linear algebra calculations on matrices.

Loss functions

There are different loss functions depending on the data you are working with. The ones that we looked at are:
- Squared error loss function
  - MSE (mean squared error): use when doing a regression, that is normally distributed, and where you want to penalize bigger mistakes more than smaller ones.
  - MAE (mean absolute error)
- Cost function: loss function averaged over all training examples. For a linear squared error loss function we would get:
- Cross-entropy loss: this is a loss function used for logistic regression. It heavily penalizes the wrong classification. The equation and graph look like:
Optimization Algorithm

There are also different optimization algorithms that can be used:
- Direct calculus solution: this does not occur often, and even when it does it can be so computationally expensive that one would opt for the iterative solution.
- Gradient descent (iterative solution) is this the main optimization algorithm:
  - The main concept of gradient descent is taking partial derivatives of the cost function and moving towards the optimum value (derivative of 0). This is done in an iterative fashion moving in the direction of steepest descent.
  - We can adjust a learning rate to determine the size of the “steps” we take for each iteration of the gradient descent algorithm.
  - There will be more on this when we go over Neural Networks
Regression

Regression is very interpretable. In order to accurately interpret the regression you must look at both the p-values of all the coefficients and the coefficients themselves. P-values are often between 0.05 and 0.01. Each coefficient is given a p-value, which can tell you if your result is statistically significant (in this case meaning that the feature DOES have a statistically significant impact on your outcome). The sign of the coefficient tells you whether the relationship between the feature and the outcome are positive (increasing) or negative (decreasing). The value of the coefficient tells you the magnitude of the effect of the feature on the outcome.

It is important that you understand that setting up your model and picking a p-value is very important. Here is an article that speaks well to three big errors: multiple comparisons, p-hacking, and hypothesizing after the results are known (HARKing).

Linear Regression

For a linear regression, a linear functions created from the features is used to make a prediction, y, of our target value, t, using weights, w, and bias, b:

This can be done with D-features. The larger D is, the more complex the model will be - so dimension reduction techniques should be used if possible. One possible way of doing linear regression on a relationship that is not linear is by feature mapping.

Polynomial feature mapping occurs when the data is fit with a degree-M poly nominal function:

which will then give us the feature mapping of:
which is linear wrt each weight.

It is important to keep in mind that too small of an M will result in underfitting but too large of an M will result in overfitting.

Logistic Regression

Logistic regression is used when we have a binary classification problem. In this case, we do not have values outside of [0,1]. This is why the logistic function is a kind of sigmoid:

The coefficients of a logistic regression are highly interpretable because you can transform the values to be the percent increase/decrease towards your outcome. This is done by putting the coefficient as an exponent of e. Here are two articles that discuss how to interpret the coefficient: 1 and 2.

More details on logistic regression.
Introduction to Neural Networks
A neural network is a simplistic version modelled after the human brain. The format is:
- Layers (input, hidden, and output)
- Nodes (or neurons)
- Connections between neurons (how connections are made depend on the type of neural network - e.g. in a feed-forward NN each neuron from one layer is connected to the neurons directly in the layer proceeding it)
  Each connection is modelled by:
  
  Where the activation function can change from layer to layer (common ones being: ReLU, softmax, hard-threshold, logistic, sigmoid and many more)
  
  Activation functions are currently being researched to find the best ones to increase predictions. ReLU has been a top performing one as it is non-linear "enough", but there are issues that arise (such as dying ReLU problem where a neuron is unable to learn). So there have been modifications made and new versions of ReLU exist.
Generalization (Overfitting and Regularization)
There are several things that can be done to train a model to generalize well (reduce overfitting):

Data augmentation
- If you need more data you can transform the data you have to create more training examples that are slightly different.
Reduce number of parameters or model complexity:
- Underfitting can have too few parameters
- Overfitting can have too many parameters
- For a NN, you can:
  - Reduce number of layers
  - Reduce number of parameters per layer
  - Add linear bottleneck: layer which restricts the number of connections able to be passed through
Weight decay / L2 regularization
- Regularizer is a function that quantifies how much we prefer one hypothesis over another in our model. We attach a function to the cost function to penalize larger weights.
- We want smaller weights so that the model is not swayed by small discrepancies.
- Here is an amazing video that explains the L2 regularization well.
- We can choose different regularizer penalties. In the case of L2 penalty we use the squared norms (can be Euclidean distance) of the weight matrices. This will penalize weights that are large to a greater degree than smaller weights.
- There are many kinds of regularizers to choose from (e.g. sum of absolute values, LASSO regression, etc.)
Early stopping
- This will monitor performance on the validation set and stop the training when the validation error starts increasing instead of decreasing.
- There can be an issue that arises because not every epoch will create a decrease in validation error, so it is part of the model to determine when the validation error has actually leveled off.
- Early stopping works because weights start off small and progressively get bigger, so stopping early in the epoch count will allow the weights to stay smaller.
Ensembles
- Averaging over trials using various methods will make your generalization better off.
Stochastic regularization (e.g. dropout)
- With dropout, the network is run many times and for each trial various nodes are dropped from the network. The predictions are then averaged over the trials.
- Other stochastic regularizers have been proposed: DropConnect (dropping connections instead of nodes), batch normalization, etc.
Unsupervised Learning (PCA)

Unsupervised learning is when there are no labels to the data and we must find patterns within the data. Principal component analysis is a type of unsupervised machine learning algorithm in which linear dimensionality reduction occurs.

Principal component analysis is a type of unsupervised machine learning algorithm in which linear dimensionality reduction occurs.

The data is mapped to a lower dimensional space by projections. We want to pick the subspace that maximizes variance in order to preserve the most data (and minimize error). Next, we want to ensure that we are picking an orthogonal basis. After those two requirements are met, we project each vector individually and sum together their projections. As a result of the process we end up with a decorrelation of features.

Representative learning is learning where a mapping is created that is easier to visualize or manipulate.

Extras

Researched Extras

Along the way I spent time researching different aspects of the course that I ~~wanted~~ needed to know more about

General ML Model Concepts
Overfit vs Underfit

Hyperparameters
- From this article I got a good understanding of overfitting and underfitting, along with the notes taken in class.
  Overfitting:
  
  When overfitting occurs it means the model is picking up on noise instead of the relationship. This occurs if your model is too complex. For instances, small k or large depth for a decision tree.
  When looking at variance (the amount of variability in the data) we are talking about overfitting (for squared error).
  Underfitting:
  
  When underfitting occurs the model has failed to have enough insights to get the relationship of the data. This occurs if a model is too simple or uses the wrong type of model. For instance, large k or small depth for decision tree or fitting a linear model on non-linear data.
  When talking about bias (how wrong the prediction is) we are talking about undercutting (for squared error).
- https://towardsdatascience.com/parameters-and-hyperparameters-aa609601a9ac#
Decision Trees
Decision Tree Algorithm Types

Pruning a Tree

Extracting Decision Tree Rules
- https://www.kdnuggets.com/2020/01/decision-tree-algorithm-explained.html
- This article had a lot of good ways to avoid overfitting. In it "pruning a tree" was discussed.
  We use "pruning a tree" to avoid overfitting. The method works by setting limits on specific hyperparameters. These hyperparameters for DecisionTreeClassifier() include max_depth, min_samples_leaf, and min_samples_split and can early stop the growth of the tree:
  - Max_depth: if you decrease the max_depth allowed, you will decrease the overfitting.
  - Min_samples_leaf: "The minimum number of samples required to be at a leaf node" (default 1). The larger this number the more overfitting is decreased.
  - Min_samples_split: "The minimum number of samples required to split an internal node" (default 2). The larger this number the more overfitting is decreased.
- https://mljar.com/blog/extract-rules-decision-tree/
  https://mljar.com/blog/visualize-decision-tree/ <- great for visualization
k-Means

https://towardsdatascience.com/interpretable-k-means-clusters-feature-importances-7e516eeb8d3c

Template by WowThemes.net

Anja Wu

Assignments

Highlights of Learning

Highlights of Learning:

Highlights of Learning:

Highlights of Learning:

Summary of Assignment:

Lecture Material

k-Nearest Neighbours

Curse of Dimensionality

Normalization

Why kNN?

Decision Tree

Learning

Entropy/Information Gain

Gini Impurity

Hyperparameters

Why use decision trees?

Bagging

Loss functions

Optimization Algorithm

Regression

Linear Regression

Logistic Regression

Optimization

Gradient Descent

Convolutional Networks

Pooling

Resources

Data augmentation

Reduce number of parameters or model complexity:

Weight decay / L2 regularization

Early stopping

Ensembles

Stochastic regularization (e.g. dropout)

Researched Extras