BMI 534 - Introduction to Machine Learning

CS 534 - Machine Learning

The course is cross-listed in both BMI534 and CS534, so please register for the one with available seats.

Instructor:

Hyeokhyen Kwon, Ph.D.
Assistant Professor
Department of Biomedical Informatics
Office: Rm 4105, 4th Floor, Emory Woodruff Memorial Research Building (101 Woodruff Cir, Atlanta, GA 30322)

Teaching Assistant: Seyedeh Somayyeh Mousavi (mail)

Course Overview

Machine learning is innovating many applications all across our society from autonomous vehicles to biomedicine and science. In this course, students will learn the fundamental theories (optimization, probability, linear algebra, etc) and algorithms of machine learning (supervised and unsupervised learning, etc) and also obtain practical experiences in applying machine learning techniques and analysis in real-world problems in biomedical informatics.

Learning Objectives

This course will introduce students to fundamental theory and algorithms in machine learning through lectures, homework, midterm, and a semester-long project. Taking this course, students should be able to:

Prepare datasets for machine-learning experiments
Understand the basic building blocks and general principles that underlie machine learning algorithms
Be familiar with specific, widely used machine learning algorithms for classification and regression
Formulate rigorous validation protocols, and evaluate the rigor of published experiments
Understand the bias and variance tradeoffs and strategies to mitigate overfitting
Understand fundamental machine-learning algorithms presented in the second half of the course
Learn methodology and tools to apply machine learning algorithms to real data and evaluate their effectiveness and performance

Prerequisites:

Undergraduate-level linear algebra
Undergraduate-level multivariate calculus
Undergraduate-level statistics and probability theory
Coding experience:
- Python (required)
- Matlab (optional)
- C/C++ (optional)
- R (optional)
Permission by the instructor (Send an email to the instructor)

Course Logistics

Communication & Course Materials:

Canvas:
The class syllabus, schedule, lecture slides, homework handouts/files, discussion, and grades are posted on this platform. Please ask all questions under the discussions section of Canvas. You are encouraged to answer other students’ questions when you know the answer. Do not ask or answer questions that are exactly homework questions.
Email:
If there are private matters specific to you (e.g., special accommodations, requesting alternative arrangements, etc.) or confidential matters you want to report, please email with the subject heading starting with “[CS534]”. I probably will not respond immediately but will try to respond within 24 hours(during specific busy periods I may need 48 hours). It is best to avoid last-minute questions that require immediate attention (e.g., before a deadline!). Plan accordingly.
Office Hours:
- Instructor (Hyeokhyen Kwon): Mon. 1 - 3 pm @ Instuctor’s office / Zoom
- TA (Seyedeh Somayyeh Mousavi): Thursdays (3-5 pm) @ BMI classroom / Zoom
Remote Attendance:
- You can attend remotely only if necessary. This has to be communicated with the instructor and TA at least a week before.
- Zoom

Textbook(s):

Required
- “The Elements of Statistical Learning: Data mining, Inference, and Prediction, Second Edition”, by Trevor Hastie, Robert Tibshirani & Jerome Friedman (link)
Supplemental:
- “Machine Learning: a Probabilistic Perspective”, by Kevin Murphy
- “Pattern Recognition and Machine Learning”, by Christopher Bishop
- “A First Encounter with Machine Learning”, by Max Welling (link)
- “A Course in Machine Learning”, by Hal Daumé III (link)
- “Understanding Machine Learning: From Theory to Algorithms”, by Shai Shalev-Shwartz & Shai Ben-David (link)

Expectations & Grading:

The final grade will be determined by a weighted average of all the graded items.

Component	Weight
Participation	10%
Homeworks	35%
Midterm	15%
Project	40%

Final grades may be curved up so that the class mean falls at least in a B range. The class median, mean, and standard deviation will be announced for each assignment and exam so that you have an idea of where you stand.

Class participation, punctuality, and attendance (10%):
This is not based on attendance, even if you show up for all classes you can still get a 0. The goal is to encourage class participation and active dialogue.
This can be achieved in several different fashions:
- Asking questions in class, during office hours (TAs or instructor), or on Canvas discussion
- Answering my questions in class
- Participating in-class exercises and group projects.
- Reviewing the lecture slides and corresponding readings/references after each class
- Attend office hours and contribute to the discussion, i.e. asking and answering other students’ questions
Homework (35%):
Homework assignments will help students develop analytical and programming skills. A homework assignment testing prerequisite material will be assigned on the first day to identify students who are not prepared for the course.

There will be 5 graded homework spaced over the first 2/3 of the semester and 1 final project (more details in later sections). The homework is typically due in 10-14 days.

Homework is due electronically at 11:59 PM ET on the date specified. Each student receives six 24-hour “late days” that can be used on any of the 5 homework throughout the semester. No more than 3 late days may be used on any single homework. If you have no more late days remaining, you will receive zero credit for any late homework. Additional extensions on homework will be granted with appropriate documentation.
Midterm (15%):
The midterm must be taken at the required time. Rescheduling the midterm is only permitted in emergencies. There will be no final exam so you can focus on the final project
Project (40%)
This is a critical part of the course. The semester-long team project is designed to provide you with practical experience in data handling, implementation, designing validation experiments, and teamwork and The goal of the project is to apply machine learning algorithms to real-world tasks or to prepare you for machine learning-related research and project management.

Topic:
The project topic is open-ended, but needs to be within the scope of the machine learning.
It can be an application project or an algorithmic project (developing a new learning algorithm or a novel variant of an existing state-of-the-art).
Each team can select any problem and dataset, provided that the problem and tasks haven’t been solved in the latest papers for the dataset. This means that for your proposed topic, you will need to study state-of-the-art papers and provide a justification for why it is interesting.
There is a strong preference for using a publicly available dataset. If you intend to collect the needed data yourself, keep in mind this is only one part of the expected project work and can often take considerable time.
Also, the topic can be related to your research interest, but DO NOT select graduate research mentored by your advisor as your topic.

Team size:
You will work in groups of 2-3 for the final project.

Evaluation:
Each team has four deliverables:
- Proposal (15%)
- Spotlight: a short “madness” presentation (10%)
- Final presentation (25%)
- Final report. (50%)
The team size will also be taken under consideration when evaluating the scope of the project in breadth and depth, meaning that a 3-person team is expected to accomplish more than a 2-person team.
Teammates will score each other’s contributions, and each student’s final project grade will be weighted using this feedback. A fair distribution of work is achieved when each team member has an equal amount of work.

Project Proposal:
For the proposal, your group will pick a topic and receive feedback from the instructor. Your proposal should be a type-set PDF document that contains approximately 1-2 pages worth of material. It should have the following parts:
- Format & Contents
  - Title of the project.
  - Full names of all team members
  - Description of the problem and the data.
    Explain the aim of your project. What will be the criteria for the success of your project? For example, it can be achieving a certain level of performance metric (accuracy, f1 score, roc curve) for your problem. If you are using an existing benchmark dataset (e.g., Kaggle competition), it is not sufficient to provide just the link. To ensure that the proposal is self-complete, it’s important to provide a summary of the dataset description that is easily understandable by others.
  - Description of what you plan to do and how your work is different than existing other work on this dataset/domain.
    - what are the machine learning methods you plan to apply or improve on?
    - What are the experiments you plan to run and how do you plan to evaluate the algorithms?
    You are not obligated to stick to what you propose, but this is to ensure you’ve done the appropriate literature search and have an initial plan of attack.
  - Potential Pitfalls and Mitigation Strategies.
    What are the potential challenges you foresee in your projects to achieve your aims? Provide a short list of the challenges and mitigation plans to overcome the challenge. This is to help you think through the backup plans.
  - Short list of references.
    The relevant articles need to be cited in the main text accordingly.
- Grading
  The project proposal is mainly intended to make sure you decide on a project topic, think about the initial steps to take, and get feedback early. As long as your proposal follows the instructions above and the project seems reasonably well-thought-out, you should do well on the proposal.
Spotlight:
For the spotlight, each team will give a 1 to 2-slide, 90-second talk to convey an overview of their project. This gives each team a chance to experience talking to a large audience and get some presentation feedback while also getting an idea of what all the other teams are working on.
- Format
  1 Powerpoint presentation slide (PPT) that will auto-advance to the next slide.
- Grading
  Each team will be scored by the other teams and the instructor based on clarity, problem motivation, quality of the content, and overall presentation quality. The scoring rubric where the full score is 10 will be based on the following:
  - Overview + motivation (3):
    - Is the problem well described and motivated?
    - Is the problem novel and challenging?
  - Methodology (5):
    - Is the approach well described and justified?
    - What do the dataset/features look like?
    - What are the preprocessing, models, metrics, and evaluation methodology?
    - How does it compare to existing work?
    - Is the potential pitfall and mitigation strategy justified?
  - Presentation/slide quality and clarity (2):
    - Is the presentation clear, coherent, and compelling?
Presentation:
For the presentation, each team will give a ~10-15 minute talk (depending on the number of groups and group size).
- Format
  Slides with content that motivate the problem, the approach, and the experimental results. A PDF version of the slides will be submitted to Canvas.
- Grading
  Each team will be scored by the other teams and the instructor based on clarity, problem motivation, quality of the content, and overall presentation quality. The scoring rubric will be based on the following:
  - Problem overview and motivation (20%):
    - Is the problem well described and motivated?
    - Is the problem novel and challenging?
  - Methodology (40%):
    - Is the approach well described and justified?
    - What do the dataset/features look like?
    - What are the preprocessing, models, metrics, and evaluation methodology?
    - How does it compare to existing work?
  - Preliminary results (20%):
    - Are there any preliminary results or findings for preprocessing and the selected models?
    - How does the result compare to existing work (if any)?
    - Is the project on track to deliver?
    - Is the future work clear and justified?
  - Presentation/slide quality and clarity (20%):
    - Is the presentation clear, coherent, and compelling?
Report
Each team will submit a final project report. If you did this work in collaboration with someone else, or if someone else (such as another professor) had advised you on this work, your write-up must fully acknowledge their contributions.
- Format
  A typeset PDF document that is between 6-12 pages (single columns and no smaller than 11-point font) with unlimited pages for references. Suggested length (not required): 6-8 pages for 2-person teams, 10-12 pages for 3-person teams. The typical (but not always obligatory) ingredients of a project report are:
  - Abstract:
    - 1 paragraph
  - Introduction:
    - What is the problem being addressed?
    - Why is it important?
    - Overview and high-level rationale of your approach.
    - Highlight the novelty, key contributions, or significance of your project.
  - Background:
    - Past related work on the problem done by others.
  - Methods:
    - Describe your learning algorithms or proposed algorithm(s).
    - Introduce any relevant mathematical notation if needed.
    - For each algorithm, give a short description of how it works.
    - For algorithms covered in class, this can be 1-3 sentences to convey why it might be useful for this problem.
    - If you are using a niche or cutting-edge algorithm, you may want to explain your algorithm in 1-2 paragraphs
  - Experiments:
    - Data description
    - Plan for Exploratory data analysis, preprocessing, feature extraction, and feature selection:
      - Enough to ensure your experiments can be replicated by others
    - Modeling choices:
      - What models did you select? Why?
      - How were parameters determined?
      - Other design choices such as subsampling, oversampling, etc.
    - Evaluation plan:
      - What is the cross-validation approach and metric you used to evaluate the model? Why?
      - If any, what is your baseline method to compare with the proposed method?
  - Result:
    - Empirical results and comparisons (figures and tables)
  - Discussion:
    - What are the key findings and lessons learned from the experiment?
    - What are the limitations of this work?
    - What are the future work or applications for this work based on the findings?
  - Conclusion:
    - Summary of motivation, contribution, experiment result, findings, and future work
    - 1 paragraph
  - Team contributions
    - What did each team member work on for the project?
    - What are the percentages (%) of the contributions of each team member?
  - Code/Dataset:
    - Include the link to a Github repository, Google Drive folder, or Dropbox folder with the code and dataset
    - The code should be executable standalone with the provided dataset. So, provide an execution script and description for the instructor and TA to run the code and check results efficiently.
    - Include the comments on the code for the instructor and TA to understand the process of the code.
    - The result in the report should be replicable. This means that, if you are using any random function in your model or experiment, the final version should fix the random seed for the replication.
- Grading
  The final report will be judged based on the clarity of the report, the novelty of the problem, the technical quality, the analysis, and the significance of the work. The major components include a clear motivation and description of the project goals, the chosen methodologies (preprocessing and machine learning models), the presentation and evaluation of the results, replicability of the results (is the description such that someone well-versed in machine learning could obtain similar results on the same dataset), insights from the project and analysis of the results, appropriate/relevant reference list, and the quality of the writing (grammar and style).

University Policies and Academic Integrity

Any suspected violations of course rules or the Emory’s Honor Codes will be referred to the honor council for a hearing.
This includes but is not limited to consulting electronic or printed materials during midterm and plagiarism on homework or class projects.
It is your responsibility to understand the Laney Graduate School Honor Code, the Emory College Honor Code, and the Department Statement of Policy on Computer Assignments.

ChatGPT (or Any other Large Language Model and Internet Resourses):

The Internet and ChatGPT can also be useful resources for learning.

DISCLAIMER: You are responsible for discerning whether the resource is reliable and correct (i.e., caveat emptor). While it is okay to look at resources for the broad topic (e.g., how decision trees work), it is not okay to look for solutions to a specific homework problem (e.g., how to implement a decision tree in Python). Please ask if you are unsure whether something is allowed. You must cite all online sources used while working on homework and final projects. It is always your responsibility to learn if a source is allowed.

Apparent copies from any source, including your colleagues and internet sites, will be referred to the appropriate honor council. Every homework submission must have a README file with the following comments:

/* THIS CODE IS MY OWN WORK, IT WAS WRITTEN WITHOUT CONSULTING CODE WRITTEN BY OTHER STUDENTS OR LARGE LANGUAGE MODELS LIKE CHATGPT. Your_Name_Here */
Homework:

assignments should be completed independently. All submissions must contain only your original work and reflect your understanding of the assignment. A signed honor pledge be submitted with each homework assignment. Assignments will not be accepted without a pledge. Discussing homework assignments is not expressly forbidden. Code should not be communicated under any circumstances. Sharing and receiving codes related to homework assignments will be considered a violation of the honor code.
Midterm:

The use of electronic devices during quizzes is forbidden and any incidents will be reported as honor code violations.
Accommodations:

The Department of Computer Science and Biomedical Informatics at Emory supports equal access for students with disabilities. Any students needing special accommodations due to a disability should contact the Department of Accessibility (DAS) and appropriate arrangements will be made.

If you are a student who is currently registered with DAS and have not received a copy of your accommodation notification letter within the first week of class, please notify DAS immediately. Students who have accommodations in place are encouraged to contact the instructor during the first week of the semester to communicate your specific needs for the course as it relates to their approved accommodations. All discussions with DAS and faculty concerning the nature of your disability remain confidential.

(Tentative) Course Schedule

Topics may change but the homework, midterm, and project deliverables are fixed. The reading material listed below is optional and the lecture plan may deviate over the course of the semester.

#	Date	Theme	Topic	Reference (Chapter)	Assignment
1	1/17	Intro + Course Logistics	Review syllabus, Overview of course topics	Ch. 1 (Hastie et al.) Ch. 1 (Murphy) Ch. 3 (Welling)	Homework #0 out (Due 1/30)
2	1/22	Intro to Optimization		Convex optimization notes Part I and II from Stanford’s machine learning class Rosenberg’s abridged notes
3	1/24	Intro to Statistics, Probability, and Random Variables	Random variables, probability density functions, conditional and joint distributions, Bayes rule	Handouts
4	1/29	Statistical Decision Theory + Linear Regression	Mapping machine learning problems to statistical concepts, Regression, ridge regression	Ch 1 -2; Ch 3.1 - 3.4 (Hastie et al.) Ch. 17.1 - 17.2 (Barber) Prof. Carlos Carvalho’s MLR Slides
5	1/31	Linear Regression + Naive Bayes	LASSO regression, elastic net regression		Homework #1 out (Due 2/13)
6	2/5	Linear Classification	logistic regression, LDA, QDA	Ch 2.1 - 2.4; Ch 4.1 - 4.4 (Hastie et al.)
7	2/7	Linear Classification + Bias-Variance Tradeoff	Training & test error, conditional and expected test error, bias-variance decomposition and tradeoff, training error optimism	Ch 7.2 - 7.3 (Hastie et al.) Ch. 5.9 (Daumé III)
8	2/12	Model Assessment + Error Measures	Validation as an estimation problem, cross validation, bias and variance of cross validation schemes, Error measures, class imbalance, ROC analysis, precision-recall	Ch. 7.10 (Hastie et al.) Ch. 2.5 - 2.6 (Daumé III)
9	2/14	Model Selection	Effective number of parameters, Akaike and Bayes information criterion	Ch. 7 (Hastie et al.) Ch. 5.5 - 5.6 (Daumé III)	Homework #2 out (Due 2/27)
10	2/19	Practical Issues	Preparing data, labeling issues, interpretation	Ch. 9 -10 (Hastie et al.)
11	2/21	Decision Trees	Decision trees, boosting	Ch. 9.2 (Hastie et al.) Ch. 1.3 (Daumé III)
12	2/26	Perceptron + Support Vector Machines	Perceptron, SVM, kernel SVM	Ch. 12 (Hastie et al.) Ch. 4; Ch. 11 (Daumé III) Ch. 7 - 9 (Welling) Ch. 15 (Shalev-Shwartz & Ben-David) Standford SVM notes NYU SVM notes
13	2/28	Neural Networks	Architectures, gradient optimization, back propagation	Ch. 11 (Hastie et al.) Ch. 1-3 (Nielsen) Ch. 20.1 - 20.3 (Shalev-Shwartz & Ben-David)	Homework #3 out (Due 3/14)
14	3/4	Neural Networks			Project Proposal due 3/5
	3/6	Spring Break
	3/11	Spring Break
15	3/13	Additive Models + Bootstrap	ADABoost, gradient boosting	Ch. 7.11; Ch. 9.1 (Hastie et al.)
16	3/18	Boosting		Ch. 10 (Hastie et al.)	Homework #4 out (Due 4/2)
17	3/20	Random Forest	Ensemble methods, random forests	Ch. 15 - 16 (Hastie et al.) Breiman’s paper	Project Spotlight Slides Due 3/24
18	3/25	Project Spotlight + Ensembles
19	3/27	Prototype methods + Challenges with High-dimensional Data + Demensionality Reduction	KNN, Curse of dimensionality, sparse representation	Ch. 13 - 14; Ch. 18 (Hastie et al.) Ch. 3.2 - 3.3 (Daumé III) Ch. 5 (Welling) Ch. 19.1 - 19.2; Ch. 23 (Shalev-Shwartz & Ben-David) Stanford PCA notes
20	4/1	Dimensionality Reduction	Principal component analysis, locally-linear embedding, manifold learning	Ch. 14 (Hastie et al.)
21	4/3	Clustering + Mixture modeling	K-means, spectral clustering, expectation maximization	Ch. 14 (Hastie et al.)	Homework #5 out (Due 4/16)
22	4/8	Reinforcement Learning	Markov Decision Process
23	4/10	Reinforcement Learning	Q-Learning
24	4/15	Bayesian Network	Probabilistic Graphical Model
25	4/17	Filtering + Time-series Analysis	Kalman Filter, Hidden Markov Model
21	4/22	Midterm Exam
27	4/24	Ethics in AI
28	4/29	Project Presentations			Final Report Due 5/10