Developing diverse coding projects blending data analytics and interactive experiences through Python, Pandas, and Java

ROLE
Programmer

TIMELINE
January 2023 - December 2024



TOOLS
Java
Python
Jupyter Notebook
Pandas
NumPy

SKILLS 
Data Analytics
Game Development
Feature Engineering
EDA

Project Overview
This Java-based 2D game immerses players in a procedurally generated world composed of interconnected rooms and tiles. Players control an animated avatar that can move seamlessly across the environment while a custom rendering engine updates the game state in real time. A save/load feature preserves player progress, and an integrated point system tracks achievements. By blending object-oriented principles with dynamic world generation and graphical updates, the game demonstrates robust problem-solving skills and end-to-end game design expertise.

GitHub Repository

Project Overview
In this Python-based data science project, I leveraged Jupyter Notebooks to analyze an extensive housing dataset. By conducting Exploratory Data Analysis (EDA), I identified meaningful patterns, handled missing values, and performed feature engineering to refine predictive capabilities. Leveraging scikit-learn, I built a linear regression model tailored for housing price prediction. This approach demonstrated proficiency in data cleaning, feature selection, and model evaluation, yielding insights into which variables significantly influence market values.


We begin by examining the distribution of our target variable Sale Price We have provided the following helper method `plot_distribution` that you can use to visualize the distribution of the Sale Price using both the histogram and the box plot at the same time.


To zoom in on the visualization of most households, we will focus only on a subset of Sale Price for this assignment. In addition, it may be a good idea to apply log transformation to Sale Price. In the cell below, reassign `training_data` to a new dataframe that is the same as the original one. `training_data` should contain only households whose price is at least $500. `training_data` should contain a new `Log Sale Price` column that contains the log-transformed sale prices.


Now I created a visualization that clearly shows if there exists an association between  `Bedrooms` and `Log Sale Price`- It should avoid overplotting.- It should have clearly labeled axes and a succinct title.- It should convey the strength of the correlation between Sale Price and the number of rooms: in other words, you should be able to look at the plot and describe the general relationship between `Log Sale Price` and `Bedrooms`


Finally I using `plt.scatter` created a scatter plot to plot the residuals from predicting `Log Sale Price` using only the 2nd model against the original `Log Sale Price` for the validation data. With such a large dataset, it is difficult to avoid over plotting entirely. You should also ensure that the dot size and opacity in the scatter plot are set appropriately to reduce the impact of over plotting as much as possible.

Project Overview
In this project, I developed a binary classification model to distinguish between spam (unwanted or commercial) and ham (legitimate) emails, building upon foundational work from a previous assignment. The dataset, sourced from SpamAssassin, contains a total of 8,348 labeled examples and 1,000 unlabeled examples, reflecting a diverse cross-section of real-world emails. Using feature engineering techniques specific to text data—such as tokenization, stop-word removal, and TF-IDF transformations—I constructed a Logistic Regression classifier via scikit-learn to achieve robust performance while minimizing overfitting. Additional steps included model validation and the generation of ROC curves to assess overall predictive accuracy. This realistic dataset introduced challenges around handling offensive or inappropriate spam content; however, it also offered the benefit of a genuine, unfiltered look into modern email filtering requirements.

GitHub Repository


Created a bar chart  by comparing the proportion of spam and ham emails containing specific words. These words have different proportions for the two classes (i.e., noticeably different bar heights across spam and ham).


Next I made a dot plot showing something meaningful about the data that helped during feature selection and model selection. The box plot represents the lengths of the lengths of spam and ham emails. Looking at the spread, the median (central line in each box) highlights the length of emails in each type. We can notice the median of spam emails is significantly higher than that of ham emails. The IQR shows us that variability of email lengths, in this case, a wider IQR in spam emails might imply that spammers vary their email lengths to evade detection. A large number of outliers in spam emails suggest attempts to disguise spam by drastically varying email length. Additionally, if there's high variability in the length of spam emails, features that capture this variability (like standard deviation) might also be helpful. Using the graphs can help understand and differentiate between the two types of emails, so the model can classify if the email is spam or ham more effectively.


Finally I created a visualization that clearly shows if there exists an association between  `Bedrooms` and `Log Sale Price`- It should avoid overplotting.- It should have clearly labeled axes and a succinct title.- It should convey the strength of the correlation between Sale Price and the number of rooms: in other words, you should be able to look at the plot and describe the general relationship between `Log Sale Price` and `Bedrooms`