Assignment 1: Colab and ML basics#
Name: [Double click to edit and add Name]
Team Members (if any): [Add names here]
Instructions#
IMPORTANT: Make sure to
Save a Copyin your local drive or todownloadthis notebook on your local machine.Don’t Panic: We will walk through this step-by-step.
Run in Order: Make sure you run the cells from top to bottom.
Look for the Comments: Wherever you see
### YOUR CODE HERE ###, that is where you need to type.
Attribution: This notebook was created by Michael Roman for use in the CYBERSEC 520 course a DUKE University. This notebook has been adapted from Lecture notes. Gemini 3-Pro was used to format and clean up this notebook.
Step 0: Notebook Setup#
We need to bring in our tools. These are the standard libraries used by Data Scientists everywhere.
Pandas: For handling data like Excel sheets.
Seaborn/Matplotlib: For drawing graphs.
Sklearn: The Machine Learning library.
Action: Click the “Play” button on the cell below.
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Import Machine Learning tools
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler
from sklearn import datasets
from sklearn.svm import SVC
# Make the graphs look nice
plt.style.use('seaborn-v0_8-whitegrid')
print("Setup Complete! You are ready to start.")
Setup Complete! You are ready to start.
Step 1: Testing with Penguins#
Before we tackle a Cyber dataset, let’s make sure everything works with a simple dataset: The Palmer Penguins.
Action: Run the cell below to load the data.
# Load the data from a URL
url = "https://github.com/allisonhorst/palmerpenguins/raw/5b5891f01b52ae26ad8cb9755ec93672f49328a8/data/penguins_size.csv"
penguins = pd.read_csv(url)
# Drop missing values to prevent errors
penguins = penguins.dropna()
# We will just use two math features for the warm up: Bill Length and Flipper Length
# We are trying to predict the 'species'
X_warmup = penguins[['culmen_length_mm', 'flipper_length_mm']]
y_warmup = penguins['species_short']
# Split into Train and Test (80% training, 20% testing)
X_train_w, X_test_w, y_train_w, y_test_w = train_test_split(X_warmup, y_warmup, test_size=0.2, random_state=42)
print("Penguin data loaded and split successfully.")
X_warmup.head()
Penguin data loaded and split successfully.
| culmen_length_mm | flipper_length_mm | |
|---|---|---|
| 0 | 39.1 | 181.0 |
| 1 | 39.5 | 186.0 |
| 2 | 40.3 | 195.0 |
| 4 | 36.7 | 193.0 |
| 5 | 39.3 | 190.0 |
1.1 Train a Simple Model#
Now, we will train a K-Nearest Neighbors (k-NN) model.
Action: Change
n_neighbors=5to a different number (like 1, 5 or 7) if you want, and run the cell.
# 1. Create the model
# You can change n_neighbors to see what happens
model = KNeighborsClassifier(n_neighbors=5)
# 2. Train the model (The "Learning" part)
model.fit(X_train_w, y_train_w)
# 3. Predict on the test set
predictions = model.predict(X_test_w)
# 4. Check the score
score = accuracy_score(y_test_w, predictions)
print(f"Warmup Accuracy: {score:.2f} (or {score*100:.1f}%)")
Warmup Accuracy: 1.00 (or 100.0%)
Step 2: Cyber Data#
Now that we know the code works, let’s use the real assignment data.
Instructions:
Download the dataset from Canvas or import directly from Github.
If using Google Colab: Click the folder icon on the left, then the upload icon (page with an arrow), and upload your
.csvfile. If you want to save the file in your drive, make sure youmountyour drive by running the cell below.Action: Update the
filenamevariable below to match your file’s name.
from google.colab import drive
drive.mount('/content/drive')
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[4], line 1
----> 1 from google.colab import drive
2 drive.mount('/content/drive')
ModuleNotFoundError: No module named 'google'
# Update this string to match your filename!
path = ## Your Path Here##
#note - if you upload to google drive it may read something like this:
#path = "drive/My Drive/CYBERSEC520/MachineLearningCVE/Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv"
# Load the data
# We use 'try' and 'except' in case the file isn't found
try:
df = pd.read_csv(path)
print("Data loaded successfully!")
except FileNotFoundError:
print("ERROR: File not found. Did you upload it? Did you spell the name right?")
Data loaded successfully!
2.1 Clean and Subsample#
Cyber datasets are huge. To speed up your runs, we will take a random sample (a smaller slice) of the data.
# 1. Clean up weird values (Infinite numbers and empty spots)
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.dropna(inplace=True)
# 2. Subsample
# n=2000 means you will take a random selection of 2000. random_state=42 fixes the random seed for repeatability
df_small = df.sample(n=2000, random_state=42)
print(f"Original size: {len(df)} rows")
print(f"New size: {len(df_small)} rows")
Original size: 225711 rows
New size: 2000 rows
2.2 Prepare the Data#
We need to separate the Features (the math/stats columns) from the Label (Benign vs. Attack).
# For the dataset in class, the column name for the target is ' Label' (with a space) make sure this maps to your dataset 'Label'
# Let's check the columns first
print(df_small.columns)
# Define X (Features) and y (Target)
# We drop the label from X
X = df_small.drop(columns=[' Label'])
y = df_small[' Label']
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Cyber data is ready for training.")
Index([' Destination Port', ' Flow Duration', ' Total Fwd Packets',
' Total Backward Packets', 'Total Length of Fwd Packets',
' Total Length of Bwd Packets', ' Fwd Packet Length Max',
' Fwd Packet Length Min', ' Fwd Packet Length Mean',
' Fwd Packet Length Std', 'Bwd Packet Length Max',
' Bwd Packet Length Min', ' Bwd Packet Length Mean',
' Bwd Packet Length Std', 'Flow Bytes/s', ' Flow Packets/s',
' Flow IAT Mean', ' Flow IAT Std', ' Flow IAT Max', ' Flow IAT Min',
'Fwd IAT Total', ' Fwd IAT Mean', ' Fwd IAT Std', ' Fwd IAT Max',
' Fwd IAT Min', 'Bwd IAT Total', ' Bwd IAT Mean', ' Bwd IAT Std',
' Bwd IAT Max', ' Bwd IAT Min', 'Fwd PSH Flags', ' Bwd PSH Flags',
' Fwd URG Flags', ' Bwd URG Flags', ' Fwd Header Length',
' Bwd Header Length', 'Fwd Packets/s', ' Bwd Packets/s',
' Min Packet Length', ' Max Packet Length', ' Packet Length Mean',
' Packet Length Std', ' Packet Length Variance', 'FIN Flag Count',
' SYN Flag Count', ' RST Flag Count', ' PSH Flag Count',
' ACK Flag Count', ' URG Flag Count', ' CWE Flag Count',
' ECE Flag Count', ' Down/Up Ratio', ' Average Packet Size',
' Avg Fwd Segment Size', ' Avg Bwd Segment Size',
' Fwd Header Length.1', 'Fwd Avg Bytes/Bulk', ' Fwd Avg Packets/Bulk',
' Fwd Avg Bulk Rate', ' Bwd Avg Bytes/Bulk', ' Bwd Avg Packets/Bulk',
'Bwd Avg Bulk Rate', 'Subflow Fwd Packets', ' Subflow Fwd Bytes',
' Subflow Bwd Packets', ' Subflow Bwd Bytes', 'Init_Win_bytes_forward',
' Init_Win_bytes_backward', ' act_data_pkt_fwd',
' min_seg_size_forward', 'Active Mean', ' Active Std', ' Active Max',
' Active Min', 'Idle Mean', ' Idle Std', ' Idle Max', ' Idle Min',
' Label'],
dtype='object')
Cyber data is ready for training.
Step 3: Baseline Model#
We will create a “Baseline”. This is the model with default settings. We need this number to see if we can improve it later.
Action: Fill in the code below. Use KNeighborsClassifier, SVC, or another model if you learned one you like better.
# --- STUDENT AREA ---
# 1. Initialize the model (Use default settings, leave the parentheses empty)
# Hint: clf = KNeighborsClassifier() or clf = SVC()
clf_baseline = SVC() ### YOUR CODE HERE ###
# 2. Fit the model on X_train and y_train
clf_baseline.fit(X_train, y_train) ### YOUR CODE HERE ###
# 3. Predict on X_test
y_pred_base = clf_baseline.predict(X_test) ### YOUR CODE HERE ###
# --- END STUDENT AREA ---
# Calculate Accuracy
baseline_acc = accuracy_score(y_test, y_pred_base)
print(f"Baseline Accuracy: {baseline_acc:.4f}")
Baseline Accuracy: 0.9600
Step 4: Hyperparameter Tuning#
Now, play the role of the Scientist. Change the settings (Hyperparameters) to try and beat your baseline score.
If you are using k-NN, try changing:
n_neighbors(Try 1, 3, 10, 50)weights(Try ‘uniform’ vs ‘distance’)
If you are using SVM, try changing:
C(Try 0.1, 1.0, 10, 100 etc. )kernel(Trylinear,poly, orrbf)gamma(Try 1.0, 0.1, 0.01, etc.)
Note: gamma is only used for rbf, poly, and sigmoid kernels.
# --- EXPERIMENT 1 ---
# Change the parameters inside the parentheses
# If you are sticking with k-NN change n_neighbors (try 1, 3, 11) or weights ('uniform', 'distance')
# Example: clf_experiment = KNeighborsClassifier(n_neighbors=11, weights='distance')
#If you are using the SVC (Support Vector Machine Classifier) try changing C (0.1, 1, 10), kernel ('linear', 'poly', 'rbf'), or gamma (0.1, 0.01)
# Note: gamma is only used for 'rbf', 'poly', and 'sigmoid' kernels.
# Example: clf_experiment = SVC(C=1.0, kernel='rbf', gamma=0.1)
clf_experiment = ### YOUR CODE HERE ###
# Train
clf_experiment.fit(X_train, y_train)
# Predict
y_pred_exp = clf_experiment.predict(X_test)
# Score
exp_acc = accuracy_score(y_test, y_pred_exp)
print(f"Experiment Accuracy: {exp_acc:.4f}")
# Compare
if exp_acc > baseline_acc:
print("🎉 Success! You improved the model!")
else:
print("🤔 Hmmm, the baseline was better. Try changing the number again.")
Experiment Accuracy: 0.9650
🎉 Success! You improved the model!
Step 5: Documentation & Reflection#
Double click this cell to edit it and answer the questions.
1. What model and dataset did you choose?
Dataset: [e.g., CICIDS2017 DDoS]
Model: [e.g., k-Nearest Neighbors]
2. What was your baseline performance?
Accuracy: [Enter number here]
F1_score: [Enter number here]
3. What hyperparameter(s) did you change?
[e.g I tried the following combinations of hyperameters] |Trial | hyperparameter 1 |hyperparameter 2|…|F1_score| |—|— | —| —|—| |Baseline |default| default|…| performance| |Trial 1 |10 | default|…| performance| |…|| |…|| |Trial n|
4. What setting worked the best?
The best setting was:
5. Hypothesize why you think that is.
[Write 1-2 sentences. e.g., “I think a higher K worked better because …”]
6. What issues did you run into? Were you able to solve them?
[e.g., My model threw an error when running F1_score, so I had to explicitly map the positive label to
DDoS.]
Submission Checklist#
Did you include your name?
Did you list your teammates (if any)?
Did you run all cells?
Did you answer the reflection questions in Step 5?
To Submit: File -> Download -> Download .ipynb