Telling Stories from Numbers

By Son Tung (Bill) Do, FCLC 2023

Figure: Course sequence flow diagram for required CS courses [1]

I’ve been involved in Data Mining and Machine Learning since the end of my sophomore year through an ad posted on Blackboard. Two professors from the Computer Science (CS) department were looking for students to work in their EDM Lab. Electronic Dance Music you think? No, it’s actually Educational Data Mining. Data mining is the practice of finding patterns and correlations from a large amount of data, and this is how you can tell a story through numbers instead of words or music.

Through EDM Lab, I explore multiple sources of educational data, such as applicants’ information or college course grades, and my task is to find the trends and patterns from those huge datasets. For instance, can these two courses make a good course sequence, or should they be taken in parallel? Why are some prequel courses not affecting the student’s performance in their sequel courses at all? Should they be re-designed? How do genders affect the wording of recommendation letters? Can applicants’ information be used to predict their test scores? Those are some of the interesting questions we explore in EDM lab. Our raw data can contain half a million lines of entries, and data mining techniques help to tell stories from this seemingly boring ocean of countless numbers. The figure above shows the frequency of some CS course sequences that Fordham students take. My personal sequence is DISC -> CS2 -> DS -> DB/DM/ALG -> TOC/ORG -> OS/NET, and thousands of other students take courses in different sequences for different reasons. All the thick edges imply either required sequences or some other general consensus among CS students that we can speculate on. Every thick edge there is a potential story on its own.

Data Mining is the end result, so what are the techniques required to reach those results? This is where Machine Learning (ML) comes into play. A lot of modern Data Mining techniques are based on ML. Instead of teaching computers how to do some tasks, I can teach the computers to learn, so they can learn how to do those tasks themselves. In class, I study the mathematical models behind these ML techniques, and I can apply them to the work at EDM Lab. One learning model I’ve created together with some graduate students can predict the standardized test scores of applicants based on their other information. It shows the correlation between some geographical regions with high test scores, a story worth exploring. It’s satisfying to watch theoretical math models turn into solutions to practical problems.

The experience in EDM Lab has taught me valuable skills for my career. I learn to process and analyze large datasets, and I know how to implement ML systems. The research reinforces my interest in Machine Learning and Artificial Intelligence, which I will further pursue in my senior year at Fordham and later in my career.

[1] Daniel D. Leeds, Cody Chen, Yijun Zhao, Fiza Metla, James Guest, and Gary M. Weiss. Generalized Sequential Pattern Mining of Undergraduate Courses, Proceedings of The 15th International Conference on Educational Data Mining (EDM22), International Educational Data Mining Society, Durham, UK, July 24-27.