get_data(Dataset)
This function loads sample datasets from git repository. A list of available datasets can be found here.
Parameters
Dataset: string
Name of the dataset.
from basata.datasets import get_data
df = get_data("titanic")
null(DataFrame)
The null function gives you insight to number of null values - Table and Graph.
Parameters
DataFrame: {array-like, sparse matrix}
A pandas DataFrame.
from basata.datasets import get_data
from basata.supervised import *
df = get_data("titanic")
null(df)
correlation(DataFrame)
The correlation function gives you insight to the correlation between numerical features.
Parameters
DataFrame: {array-like, sparse matrix}
A pandas DataFrame.
from basata.datasets import get_data
from basata.supervised import *
df = get_data("titanic")
correlation(df)
eda(DataFrame)
The Exploratory Data Analysis (EDA) function is used to explore the data.
Using the sweetviz library, the data is visualized in a variety of ways in an HTML file.
DataFrame: {array-like, sparse matrix}
A pandas DataFrame.
from basata.datasets import get_data
from basata.supervised import *
df = get_data("titanic")
eda(df)
FID(DataFrame, ML_model, plot=True, length=5, height=5)
Function for creating a Feature Importance Dataframe (FID) and plotting it.
Only models with a feature_importances_ attribute are supported.
DataFrame: {array-like, sparse matrix}
A pandas DataFrame.
ML_model: {object}
A machine learning model.
plot: {boolean}
Whether to plot the FID or not.
length: {int}
The length of the plot.
height: {int}
The height of the plot.
from basata.datasets import get_data
from basata.supervised import *
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
df = get_data("titanic")
X = df.drop(columns=["Survived"])
y = df["Survived"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
FID(X_train, model)
compare_models(X, y, random_seed=None, classification=True)
Function for comparing feature importance of different ML models.
Only models with a feature_importances_ attribute are supported (Random Forest, Gradient Boosting, AdaBoost, Decision Tree, XGBoost, CatBoost).
X: {array-like, sparse matrix}
The training input samples.
y: array-like of shape (n_samples,)
The target values (class labels in classification, real numbers in regression).
random_seed: {int}
The random seed for the models.
classification: {boolean}
Whether the models are classification or regression.
Work in progress...
automl(X_test, X_train, y_train, y_test, random_seed=None, classification=True)
Function for evaluating classification and regressional models.
Parameters
X_test/X_train: {array-like, sparse matrix}
The training input from the training/test set.
y_test/y_train: array-like of shape (n_samples,)
The target values in the training/test set (class labels in classification, real numbers in regression).
random_seed: {int}
The random seed for the models.
classification: {boolean}
Whether the models are classification or regression.
Work in progress...
tuning(X, y, model='rf', GridSearch=False, scoring=accuracy_score, random_seed=None, classification=True)
This function is used to tune the hyperparameters of a machine learning model using GridSearchCV or RandomizedSearchCV. It takes as input the training data, the target variable, the model to be tuned, the scoring function, the random seed and the classification flag.
The function returns the best model and the best hyperparameters.
Parameters
X: {array-like, sparse matrix}
The training input samples.
y: array-like of shape (n_samples,)
The target values (class labels in classification, real numbers in regression).
model: string
The model to be tuned. Models: 'rf', 'gbdt', 'ab', 'dt', 'knn', 'svm', 'mlp', 'xgb', 'cb', 'lgbm'.
GridSearch: boolean
If True, GridSearchCV is used to tune the hyperparameters, else RandomizedSearchCV is used.
scoring: string
The scoring function to be used. Scoring: accuracy_score, precision_score, recall_score, f1_score, mean_squared_error, r2_score
random_seed: int
The random seed to be used.
classification: boolean
If True, classification is used.
from basata.datasets import get_data
from basata.supervised import *
from sklearn.model_selection import train_test_split
df = get_data("titanic")
X = df.drop(columns=["Survived"])
y = df["Survived"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
tuning(X_train, y_train, model='rf', GridSearch=True, scoring=accuracy_score)
xai(X_train, X_test, model=None, mode='classification', observation=0, feature_selection='none')
The eXplainable AI (XAI) function is used to explain the model's predictions. The model uses LIME to explain the predictions of the model.
Parameters
X_train: {array-like, sparse matrix}
The training input samples.
X_test: {array-like, sparse matrix}
The test input samples.
model: {model}
The model to be explained.
mode: {string}
The mode of the model. Can be 'classification' or 'regression'.
observation: {int}
The observation to be explained.
feature_selection: {string}
The feature selection method. Can be 'none', 'lasso_path' or 'auto'.
from basata.datasets import get_data
from basata.supervised import *
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
df = get_data("titanic")
X = df.drop(columns=["Survived"])
y = df["Survived"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
xai(X_train, X_test, model=model)