· Define Problem

Define the objective: What is the business or research goal? What are you trying to predict, classify, or optimize?

· Determine the type of problem: Classification, regression, clustering, etc.

· Identify success criteria: What metrics (e.g., accuracy, MSE, precision) will define success?

1) Import Data

from google.colab import drive

drive.mount('/content/drive')

import pandas as pd

df = pd.read_excel("path.xlsx")

2) Missing Values

· Handle missing values: Remove, impute, or flag them.

· Remove duplicates: Drop any duplicate rows if necessary.

· Handle outliers: Identify and decide whether to keep, remove, or transform outliers.

df.isnull().sum()
df[df.isnull().any(axis=1)]

------------------------------------------------------------------

df_cleaned = df.dropna()

df_cleaned = df.dropna(axis=1)

------------------------------------------------------------------

df_filled = df.fillna(0)
df_fill = df.fillna(method='ffill')
df_fill = df.fillna(method='bfill')

df['A'] = df['A'].fillna(df['A'].mean())

df['A'] = df['A'].fillna(df['A'].median())
df['A'] = df['A'].fillna(df['A'].mode()[0])

------------------------------------------------------------------

df_interpolated = df.interpolate()

from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2)
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

K-Nearest Neighbors (KNN) Imputer

You can impute missing values by taking the average value of the nearest neighbors.

This method is useful when the data is not linear.

------------------------------------------------------------------

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')

df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

Simple Imputer (mean, median, most frequent, constant)

The SimpleImputer from sklearn can also be used for basic imputation strategies.

------------------------------------------------------------------

You can create an additional column to flag missing values before filling them

3) Check And Remove duplicates rows

duplicates = df.duplicated()
print(df[duplicates])

------------------------------------------------------------------
df_no_duplicates = df.drop_duplicates()

------------------------------------------------------------------

df_no_duplicates = df.drop_duplicates(subset=['A', 'B'])
Drop duplicates based on column 'A' and 'B'

4) Detect And Handling Outliers

Z-Score (Standard Score):
The Z-score tells you how many standard deviations a data point is from the mean.
import numpy as np

import pandas as pd

from scipy import stats

# Calculate the Z-scores

z_scores = np.abs(stats.zscore(df))

# Set a threshold for detecting outliers (e.g., Z > 3)

outliers_z = df[(z_scores > 3).any(axis=1)]

# Remove outliers based on Z-score method
df_cleaned = df[(z_scores < 3).all(axis=1)]

------------------------------------------------------------------

IQR (Interquartile Range) Method

Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
# Identify outliers
outliers_iqr = df[((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)]

Cap Outliers (Winsorization)

# Capping outliers based on IQR

df_capped = df.copy()

# Capping the lower bound

df_capped = df_capped.apply(lambda x: np.where(x < Q1 - 1.5 * IQR, Q1 - 1.5 * IQR, x))

# Capping the upper bound

df_capped = df_capped.apply(lambda x: np.where(x > Q3 + 1.5 * IQR, Q3 + 1.5 * IQR, x))

Impute Outliers

Outliers can also be replaced with the mean, median, or mode of the data.

# Replace outliers with median

median = df.median()

df_imputed = df.copy()

df_imputed = df_imputed.apply(lambda x: np.where(x < Q1 - 1.5 * IQR, median, x)) df_imputed = df_imputed.apply(lambda x: np.where(x > Q3 + 1.5 * IQR, median, x))

Transform Data (Log Transformation):

You can apply a transformation (e.g., log or square root)
------------------------------------------------------------------
to reduce the impact of outliers.

# Apply log transformation

df_log_transformed = np.log1p(df)

# log1p avoids issues with zeroes
print("Data after log transformation:\n", df_log_transformed)

Visualization to Detect Outliers:

a) Boxplot

Boxplots are a great way to visually inspect the distribution of data and detect outliers.

# Boxplot for column 'A'
sns.boxplot(x=df['A'])
plt.show()

------------------------------------------------------------------

b) Scatter Plot

Scatter plots can also help in visually identifying outliers in two-dimensional data.

# Scatter plot for 'A' vs 'B'

plt.scatter(df['A'], df['B'])

plt.xlabel('A')

plt.ylabel('B')

plt.title('Scatter plot of A vs B')

plt.show()

This process ensures that your dataset is clean and free from outliers that could skew your machine learning model's results.

5) Exploratory Data Analysis (EDA)

Understand the data types:
Determine if the features are categorical, numerical, or text.

df.dtypes

df.info()

print ("Rows : " ,df.shape[0])

print ("Columns : " ,df.shape[1])

print ("\nFeatures : \n" ,df.columns.tolist())

print ("\nMissing values : ", df.isnull().sum().values.sum())

print ("\nUnique values : \n",df.nunique())

Summary statistics: Use descriptive statistics (mean, median, standard deviation) to summarize the dataset.

# Compute Descriptive Statistics

mean = df.mean() # Mean

median = df.median() # Median

std_dev = df.std() # Standard Deviation

---------------------------------------------------------------------------------------

summary = df.describe()

print("Summary Statistics:\n", summary)

Visualizations:

· Univariate analysis: Histograms, box plots (for understanding distributions).

· Bivariate/multivariate analysis: Correlation heatmaps, scatter plots (for relationships between variables).

· Outlier detection: Use box plots or z-scores to spot outliers.

· Class distribution analysis: Particularly important for classification tasks to check for class imbalances.

Summary:

Histograms are a useful tool in univariate analysis to visualize the frequency distribution of a variable. By plotting histograms, you can identify:

· Central tendency (mean, median, mode),

· Spread (variance),

· Skewness (asymmetry),

· Outliers (extreme values).

They help in understanding the shape and distribution of the data.

Hypothesis Testing

o One-sample tests: Compare the sample mean to a known value.

o ANOVA: Compare means across three or more groups.

o Chi-squared tests: Examine relationships between categorical variables.

6) train_test_split

· Split the dataset into training and test sets (typically 70-80% training, 20-30% testing).

· Optionally, create a validation set or use cross-validation for robust performance evaluation.

X= df.drop(['target'],axis=1)

y = df['target']

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X,y ,test_size=0.3,random_state=42)

7) Feature Engineering

· Feature creation: Create new features based on domain knowledge or patterns observed in EDA.

· Feature transformation: Log transformations, polynomial features, etc.

· Feature scaling: Normalize or standardize numerical data using methods like MinMaxScaler or StandardScaler.

· Encoding categorical data: Use one-hot encoding, label encoding, or target encoding to convert categorical features into numerical ones.

Log Transformation

· Purpose: Log transformations are commonly used to reduce skewness in the data and make distributions more normal (Gaussian). This is particularly useful for features with large ranges or positive skew.

· When to Apply: You apply log transformations when the feature values span several orders of magnitude or have a highly skewed distribution (e.g., income, population, or sales data).

import numpy as np

# Apply log transformation to skewed features

X_train['feature'] = np.log1p(X_train['feature'])

# log1p = log(1 + x) to avoid log(0)

X_test['feature'] = np.log1p(X_test['feature'])

Polynomial Features

· Purpose: Polynomial transformations create new features by raising the existing features to a power. This allows the model to capture non-linear relationships between features and the target variable.

· When to Apply: You can apply polynomial features when you suspect that the relationship between your features and target is non-linear but you want to use a linear model. It can improve performance for algorithms like linear regression but might not always work well for all models (e.g., decision trees or random forests).

from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, include_bias=False)

# Create polynomial features of degree 2
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

Other Transformation

· Box-Cox Transformation: Similar to log transformation but can handle both positive and negative data. It automatically selects the best power transformation.

from scipy.stats import boxcox

X_train['feature']= boxcox(X_train['feature'])

X_test['feature']= boxcox(X_test['feature'])

· Square Root Transformation: Another transformation used to reduce right-skewed data. This is less aggressive than the log transformation.

X_train['feature'] = np.sqrt(X_train['feature'])

X_test['feature'] = np.sqrt(X_test['feature'])

MinMax Scaler

MinMax Scaler: The MinMaxScaler scales and translates each feature individually such that it is in the given range, typically between 0 and 1.

When to Use:

· When your data doesn’t follow a Gaussian distribution.

· Useful when you need to scale features to a fixed range, e.g., when using machine learning algorithms that require data in a specific range (like neural networks).

· It’s also useful for algorithms sensitive to the scale of data (e.g., distance-based models like KNN, SVM).

from sklearn.preprocessing import MinMaxScaler

# Assuming X_train and X_test are your feature sets

scaler = MinMaxScaler()

# Fit the scaler on the training data and transform it

X_train_scaled = scaler.fit_transform(X_train)

# Transform the test data using the same scaler

X_test_scaled = scaler.transform(X_test)

Standard Scaler

Standard Scaler: The StandardScaler standardizes features by removing the mean and scaling to unit variance. This results in a distribution with a mean of 0 and a standard deviation of 1.

When to Use:

When your data follows a Gaussian distribution (or approximately normal distribution).
Suitable for algorithms that assume normally distributed features (e.g., linear regression, logistic regression, support vector machines, and neural networks).

Also useful for algorithms like PCA, where scaling is important to give each feature equal weight.

from sklearn.preprocessing import StandardScaler

# Assuming X_train and X_test are your feature sets

scaler = StandardScaler()

# Fit the scaler on the training data and transform it

X_train_standardized = scaler.fit_transform(X_train)

# Transform the test data using the same scaler

X_test_standardized = scaler.transform(X_test)

OneHot Encoding

One-Hot Encoding: One-hot encoding converts each category into a new binary feature (column). Each feature is marked as 1 if the category is present for that row, otherwise 0.

When to Use:

When the categorical variable has no ordinal relationship (e.g., colors, cities).
Works well when the number of unique categories is manageable.

from sklearn.preprocessing import OneHotEncoder

# One-hot encoding using sklearn OneHotEncoder with drop='first'

encoder = OneHotEncoder(drop='first', sparse=False)

encoded_array = encoder.fit_transform(df[['job', 'marital','education']])

# Create a DataFrame with the encoded data

encoded_df = pd.DataFrame(encoded_array, columns=encoder.get_feature_names_out()

# Concatenate the one-hot encoded dataframe with the original

df_combined= pd.concat([df, encoded_df], axis=1)

Label Encoding

Label Encoding: Label encoding assigns each category an integer value. This is more efficient than one-hot encoding but can introduce problems if the categorical variable has no natural ordinal relationship (e.g., labeling Red=0, Blue=1, Green=2).

When to Use:

When the categorical variable is ordinal
(e.g., education levels such as "High School", "Bachelor", "Master", etc.).
Avoid for non-ordinal categories unless the model can handle such labels
(like tree-based algorithms).

from sklearn.preprocessing import LabelEncoder

# Label encoding
label_encoder = LabelEncoder()
df['Color_Label'] = label_encoder.fit_transform(df['Color'])

8) Dimensional Reduction

PCA

Principal Component Analysis (PCA) is a dimensionality reduction technique used to reduce the number of features while preserving as much variance as possible. It identifies the directions (principal components) where the data varies the most, and projects the data along these components. PCA is often used to reduce feature space for better performance, or for visualization when reducing to 2 or 3 dimensions.

Steps for PCA:

· Standardize the data (this is important because PCA is sensitive to feature scaling).

· Compute the covariance matrix of the data.

· Calculate the eigenvectors and eigenvalues of the covariance matrix to find the principal components.

· Sort the eigenvalues and select the top principal components.

· Project the data onto the selected principal components.

When to Use PCA:

· Dimensionality Reduction: When you want to reduce the number of features for computational efficiency.

· Visualization: Useful to visualize high-dimensional data in 2 or 3 dimensions.

· Noise Reduction: PCA can also help in reducing noise by focusing on the components that capture the most variance.

import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

# Sample data (X_train, assuming it's already defined)
# Standardize the data (PCA is sensitive to scaling)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Apply PCA to reduce to 2 components
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)

X_test_pca = pca.transform(X_test_scaled)

# Visualize the 2D projection
plt.scatter(X_train_pca[:, 0], X_train_pca[:, 1], c=y_train, cmap='viridis')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA: 2D Projection of Data')
plt.show()

# Variance explained by each component
explained_variance = pca.explained_variance_ratio_
print(f"Variance explained by first 2 components: {sum(explained_variance)}")

T-SNE

t-SNE is another dimensionality reduction technique, but unlike PCA, it's mainly used for visualization rather than feature reduction. t-SNE focuses on preserving local relationships within the data, making it especially effective at revealing clusters or groupings in high-dimensional data when reduced to 2D or 3D. It is a non-linear technique, unlike PCA, which is linear.

Steps for t-SNE:

· High-dimensional data points are modeled by pairwise similarities.

· The goal is to minimize the divergence between two distributions: one that measures pairwise similarities in high-dimensional space and another in the lower-dimensional space.

When to Use t-SNE:

· Visualization: t-SNE is best for visualizing high-dimensional data in 2 or 3 dimensions.

· Clustering: Helps in visualizing clusters in data where labels are not obvious.

· Non-linear Relationships: t-SNE is excellent at capturing non-linear relationships, which PCA might miss.

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Assuming X_train is already defined
# Standardize the data before applying t-SNE (optional but often recommended)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Apply t-SNE to reduce the data to 2D
tsne = TSNE(n_components=2, random_state=42)
X_train_tsne = tsne.fit_transform(X_train_scaled)
X_test_tsne = tsne.transform(X_test_scaled)

# Visualize the t-SNE 2D projection
plt.scatter(X_train_tsne[:, 0], X_train_tsne[:, 1], c=y_train, cmap='viridis')
plt.xlabel('t-SNE 1')
plt.ylabel('t-SNE 2')
plt.title('t-SNE: 2D Projection of Data')
plt.show()

9) Feature Selection

Exhaustive Search

Exhaustive Search (also known as Brute Force Search) is a method of exploring all possible combinations or configurations of a given problem to find the best solution. In the context of machine learning, exhaustive search is often used in hyperparameter tuning and feature selection to try all possible combinations of hyperparameters or feature subsets to optimize model performance.

Exhaustive feature selection is another form of exhaustive search where all possible subsets of features are evaluated to find the best-performing subset.

However, evaluating all possible subsets of features is computationally expensive, especially when there are many features (exponential growth in combinations). For example, with 10 features, there are 2^10−1=1023 possible subsets to evaluate. This is impractical for large datasets.

from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Example model (Logistic Regression)
model = LogisticRegression()

# Define ExhaustiveFeatureSelector
efs = EFS(model,
min_features=1,
max_features=3,
# Set a max number of features to limit computational expense
scoring='accuracy',
print_progress=True,
cv=5)

# Apply exhaustive feature selection
efs = efs.fit(X_train, y_train)

# Get the best feature subset
print(f"Best feature subset: {efs.best_feature_names_}")

------------------------------------------------------------------------------

from mlxtend.feature_selection import ExhaustiveFeatureSelector

ex=ExhaustiveFeatureSelector(LogisticRegression(),

min_features=1,

max_features=11,

scoring='accuracy',

cv=3

)

ex.fit(x_train,y_train)

ex.best_feature_names_

ex.best_score_

pd.DataFrame.from_dict(ex.get_metric_dict()).T

x_train=x_train.iloc[ : , lambda data : [1,3,4,6]]

x_test=x_test.iloc[ : , lambda data : [1,3,4,6]]

Forward And Backward

from mlxtend.feature_selection import SequentialFeatureSelector

forward= SequentialFeatureSelector(

LogisticRegression(),

k_features=(1,11), # number of features to select

forward=True, # if false = backward

floating=False,

verbose=2,

scoring='accuracy',

cv=3

)

forward.k_feature_names_

forward.k_score_

dict(enumerate(x_train.columns.unique()))

x_train=x_train.iloc[ : , lambda data : [1,3,4,6,8]]

x_test=x_test.iloc[ : , lambda data : [1,3,4,6,8]]

Stepwise

from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Sample data and model
model = LogisticRegression()

# Create stepwise selection model (SFS with both forward and backward steps)
sfs = SFS(model,
k_features="best", # Select the best number of features
forward=True, # Forward and backward selection
floating=True, # Floating allows both forward and backward steps
scoring='accuracy', # Scoring method (accuracy in this case)
cv=5) # 5-fold cross-validation

# Fit the model to training data
sfs = sfs.fit(X_train, y_train)

# View the selected features
print(f"Selected features: {sfs.k_feature_names_}")

# Apply the selected features to transform the dataset
X_train_selected = sfs.transform(X_train)
X_test_selected = sfs.transform(X_test)

# Train and evaluate model on selected features
model.fit(X_train_selected, y_train)
y_pred = model.predict(X_test_selected)

# Check accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy after Stepwise Selection: {accuracy}")

Multicoliniariry

Multicoliniarity is a statistical concept where independent variables in a model are correlated among themselves. Remember we are not saying that independent variable are correlated with target variable. We are saying that independent variables are correlated among themselves and we need to find those independent variables and remove from the model. So that we can get better accuracy and better fit on our model.

corrmatrix= x_train.corr()

corrmatrix

sns.heatmap(corrmatrix)

def correlation(df, threshold):

correlated_cols= set()

corr_matrix=df.corr()

for i in range(len(corr_matrix.columns)):

for j in range(i):

if abs(corr_matrix.iloc[i,j]) > threshold:

colname = corr_matrix.columns[i]

correlated_cols.add(colname)

return correlated_cols

correlation(x_train,0.6)

Univariate Analysis MSE-Mean Squared Error

The smaller MSE = the better model performance , because those are errors so we want our ERROR to be as low as possible . so once we compute the mean squared error we well select those features whose MSE is low.

from sklearn.metrics import mean_squared_error

mse_values= []

for feature in x_train.columns:

lg =RandomForestRegressor()

lg.fit(x_train , y_train)

y_pred=lg.predict(x_test)

mse_values.append(mean_squared_error(y_test,y_pred))

mse_values=pd.Series(mse_values)

mse_values.index = x_train.columns

mse_values

Chi Square

from sklearn.feature_selection import chi2

f_score=chi2(x_train,y_train)

pvalue=pd.Series(f_score[1],dtype=float)

pvalue.index=x_train.columns

pvalue.sort_values(ascending=False)

pv=pd.DataFrame(pvalue.sort_values(ascending=False)).reset_index()

column_index_to_rename = 1

current_column_name = pv.columns[column_index_to_rename]

new_column_name = 'p'

column_rename_dict = {current_column_name: new_column_name}

pv.rename(columns=column_rename_dict, inplace=True)

pv['pp']=np.round(pv['p'],3)

Lasso Regression

from sklearn.feature_selection import SelectFromModel

from sklearn.linear_model import LogisticRegression

Ridge = SelectFromModel(LogisticRegression(penalty="l2",

C=10,
# if we keep C high , the impact of regulirazition will be less and if you keep the value of C low then the impact will be high.

solver="liblinear") ) # good choise for small data

Ridge.fit(X_train , y_train)

slction.get_feature_names_out()

len(slction.get_feature_names_out())

10) Handling Imbalanced Data (For Classification Problems)

Oversampling the minority class: Use techniques like SMOTE (Synthetic Minority Over-sampling Technique).
Undersampling the majority class: Reduce samples from the majority class.
Class weighting: Assign higher weights to the minority class to penalize misclassifications more.
Anomaly detection: For extreme imbalances, treat the minority class as anomalies.

from imblearn.combine import SMOTETomek

smk = SMOTETomek()

X_train, y_train = smk.fit_resample(X_train, y_train)

from imblearn.under_sampling import NearMiss

nm = NearMiss()

X_train, y_train = smk.fit_resample(X_train, y_train)

from imblearn.over_sampling import RandomOverSampler

os = RandomOverSampler(sampling_strategy=1)

X_train, y_train = smk.fit_resample(X_train, y_train)

11) Model Selection

Classification: Logistic Regression, Decision Trees, Random Forest, XGBoost, Support Vector Machines (SVM), Neural Networks.
Regression: Linear Regression, Random Forest Regressor, XGBoost, Gradient Boosting.
Clustering: K-Means, DBSCAN, Hierarchical Clustering.

12) Model Training

· Train the model: Fit the model to the training dataset.

· Ensemble methods: Use techniques like Voting, Bagging, or Boosting if needed.

o Voting Classifier/Regressor:

§ Hard Voting: Takes the majority vote for classification.

§ Soft Voting: Averages the predicted probabilities.

Cross-validation: Use k-fold cross-validation to ensure the model generalizes well.

Single Algorithm

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import make_classification
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the logistic regression model
model = LogisticRegression()
# Train the model on the training data
model.fit(X_train, y_train)
# Predict on the test set
y_pred = model.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KneighborsClassisier
from sklearn.SVM import SVC

from sklearn.ensemble import RandomForestClassifier

model=RandomForestClassifier(random_state=42)

model=LogisticRegression()

model = SVC(random_state=42)

model=KNeighborsClassifier()

# Define hyperparameters to search over

# Define the parameter grids
param_grid_log_reg = {
'C': [0.01, 0.1, 1, 10, 100],
'penalty': ['l1', 'l2', 'elasticnet', 'none'],
'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
}

param_grid_rf = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10]
}

param_grid_svm = {
'C': [0.1, 1, 10, 100],
'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
'gamma': ['scale', 'auto'],
'degree': [2, 3, 4] # Relevant only for 'poly' kernel
}

param_grid_knn = {
'n_neighbors': [3, 5, 7, 9, 11],
'weights': ['uniform', 'distance'],
'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
'leaf_size': [20, 30, 40]
}

# Fit the grid search to the training data

grid_search.fit(X_train,y_train)

# Get the best parameters and the best score from the grid search

best_params = grid_search.best_params_

best_score = grid_search.best_score_

print(f"Best Parameters:{best_params_}")

print(f"Best cross-calidation_score:{best_score_}")

# Train the model with the best parameters on the entire training set

best_model = grid_search.best_estimator_

best_model.fit(X_train,y_train)

# Predict on the test set

y_pred = best_model.predict(X_test)

# Evaluate the model performance

test_accuracy =accuracy_score(y_test, y_pred)

classification_rep = classification_report(y_test, y_pred)

print(f"test_set_accuracy:{test_accuracy}")

print("classification_report:")

print("classification_rep ")

from sklearn.ensemble import VotingClassifier

from sklearn.linear_model import LogisticRegression

from sklearn.svm import SVC

from sklearn.tree import DecisionTreeClassifier

from xgboost import XGBClassifier

from sklearn.metrics import accuracy_score

# Define classifiers

clf_1 = LogisticRegression()

clf_2 = XGBClassifier()

clf_3 = SVC(probability=True) # Important if you later use soft voting

# Voting classifier

voter = VotingClassifier(estimators=[

('lr', clf_1),

('xgb', clf_2),

('svc', clf_3)

], voting='hard') # or 'soft' if you prefer probabilities

# Fit individual classifiers

clf_1.fit(X_train, y_train)

clf_2.fit(X_train, y_train)

clf_3.fit(X_train, y_train)

# Fit voting classifier

voter.fit(X_train, y_train)

# Predictions and accuracy

models = {'Logistic Regression': clf_1,

'XGBoost': clf_2,

'SVC': clf_3,

'Voting Classifier': voter}

print("🔍 Accuracy Scores:")

for name, model in models.items():

train_pred = model.predict(X_train)

test_pred = model.predict(X_test)

train_acc = accuracy_score(y_train, train_pred)

test_acc = accuracy_score(y_test, test_pred)

print(f"{name} - Train: {train_acc:.3f}, Test: {test_acc:.3f}")

General Steps For Implementation Of A Manicha Learning Project

1. Problem Definition

· Define the objective: What is the business or research goal? What are you trying to predict, classify, or optimize?

· Determine the type of problem: Classification, regression, clustering, etc.

· Identify success criteria: What metrics (e.g., accuracy, MSE, precision) will define success?

2. Data Collection

· Gather data: Collect data from sources such as databases, APIs, sensors, or public datasets.

· Assess data quality: Ensure the data is reliable, representative, and sufficient for the task.

· Understand the dataset: Know what each feature (column) and data point (row) represents.

3. Data Preprocessing

3.1. Data Cleaning

· Handle missing values: Remove, impute, or flag them.

· Remove duplicates: Drop any duplicate rows if necessary.

· Handle outliers: Identify and decide whether to keep, remove, or transform outliers.

3.2. Exploratory Data Analysis [EDA]

· Understand the data types: Determine if the features are categorical, numerical, or text.

· Summary statistics: Use descriptive statistics (mean, median, standard deviation) to summarize the dataset.

· Visualizations:

- Univariate analysis: Histograms, box plots (for understanding distributions).

- Bivariate/multivariate analysis: Correlation heatmaps, scatter plots (for relationships between variables).

- Outlier detection: Use box plots or z-scores to spot outliers.

- Class distribution analysis: Particularly important for classification tasks to check for class imbalances.

· Hypothesis testing :

- One-sample tests: Compare the sample mean to a known value.

- ANOVA: Compare means across three or more groups.

- Chi-squared tests: Examine relationships between categorical variables.

3.3. train_test_split

· Split the dataset into training and test sets (typically 70-80% training, 20-30% testing).

· Optionally, create a validation set or use cross-validation for robust performance evaluation.

3.4. Feature selection techniques

· Exhaustive Search: Test all possible feature combinations to find the best subset.

· Forward Selection: Start with no features and iteratively add the most important ones.

· Backward Elimination: Start with all features and iteratively remove the least important ones.

· Stepwise Selection: Combine forward and backward selection methods.

· Multicollinearity Detection: Identify and remove highly correlated features using Variance Inflation Factor (VIF) or correlation matrices.

3.5. Handling Imbalanced Data ( For Classification Problems)

· Oversampling the minority class: Use techniques like SMOTE (Synthetic Minority Over-sampling Technique).

· Undersampling the majority class: Reduce samples from the majority class.

· Class weighting: Assign higher weights to the minority class to penalize misclassifications more.

· Anomaly detection: For extreme imbalances, treat the minority class as anomalies.

(Handling Imbalanced on training data only)

4. Model Selection

· Choose the appropriate algorithms for the problem type (classification, regression, etc.).

· Examples:

- Classification: Logistic Regression, Decision Trees, Random Forest, XGBoost, Support Vector Machines (SVM), Neural Networks.

- Regression: Linear Regression, Random Forest Regressor, XGBoost, Gradient Boosting.

- Clustering: K-Means, DBSCAN, Hierarchical Clustering.

5. Model Training

· Train the model: Fit the model to the training dataset.

· Ensemble methods: Use techniques like Voting, Bagging, or Boosting if needed.

- Voting Classifier/Regressor:
1. Hard Voting: Takes the majority vote for classification.
2. Soft Voting: Averages the predicted probabilities.

· Cross-validation: Use k-fold cross-validation to ensure the model generalizes well.

6. Hyperparameter Tuning

· Grid Search: Try all possible combinations of hyperparameters to find the optimal ones.

· Random Search: Randomly sample hyperparameter combinations from a distribution.

· Bayesian Optimization: A more advanced search strategy to find the best hyperparameters based on previous results.

· Examples of hyperparameters: Learning rate, tree depth, number of estimators, regularization parameters (L1, L2).

7. Model Evalution

· Train the model: Fit the model to the training dataset.

· Use relevant evaluation metrics based on the problem:

- Classification: Accuracy, Precision, Recall, F1-Score, AUC-ROC.

- Regression: Mean Squared Error (MSE), Mean Absolute Error (MAE), R² score.

· Confusion Matrix: Understand the true positives, false positives, true negatives, and false negatives for classification models.

· ROC and Precision-Recall Curves: Assess the performance of classifiers, especially when dealing with imbalanced data.

8. Model Optimization

· Ensemble methods:

- Bagging: Combine multiple weak models (e.g., Random Forest) trained on random subsets of data to reduce variance.

- Boosting: Sequentially train models (e.g., XGBoost, LightGBM) to focus on correcting errors made by previous models.

· Feature selection: Eliminate irrelevant or redundant features to improve performance.

· Regularization: Apply L1 (Lasso) or L2 (Ridge) regularization to reduce model complexity and prevent overfitting.

9. Model Deployment

· Save the model: Export the trained model using libraries like joblib or pickle.

· Deploy the model: Deploy in a production environment, typically as an API (e.g., using Flask, FastAPI, or Django).

· Deploy to cloud: Deploy on cloud services such as AWS, GCP, or Azure, or use containerization with Docker for portability.

10. Model Monitoring And Maintenance

· Monitor performance: Track the model's accuracy, recall, or other key metrics over time.

· Detect data drift: If the input data distribution changes, retrain or fine-tune the model to ensure continued accuracy.

· Retraining: Periodically retrain the model on new data or when performance degrades.

11. Documentation And Reporting

· Document the process: Include a clear explanation of every step (data preprocessing, model selection, training, and evaluation) for reproducibility.

· Generate reports: Present results, insights, and model performance to stakeholders using Jupyter Notebooks, dashboards, or reporting tools like Tableau or Power BI.

12. Data Visualization And Reporting (Using PowerBI)

· Create Dashboards: Build interactive dashboards in Power BI to visualize model performance and key metrics.

· Present Insights: Use Power BI to visualize trends and correlations derived from EDA and model predictions.

· Stakeholder Communication: Share findings with stakeholders using Power BI reports for easier understanding.

· Update Reports: Regularly refresh Power BI reports to include new data or model updates.

Page updated

Google Sites

Report abuse