BASATA Logo

BASATA is the Arabic word for simplicity. The purpose of the BASATA library is to simplify coding while providing profound insights. You can use BASATA to write code that is easy to read, understand, and maintain.

The BASATA library is used to automate machine learning tasks. Building on the spirit of AutoML, BASATA is a framework for analysing data and building machine learning models.

The library has been tested on reel use cases and will be the number one tool for professional Data Scientists.

The current version supports supervised machine learning and will be expanded to unsupervised followed by Time Series analysis. The project is open source and will grow through contributions by its users. Visit the GitHub page for further insights and examples.

Datasets

Get data

get_data(Dataset)

This function loads sample datasets from git repository. A list of available datasets can be found here.

Parameters

Dataset: string
Name of the dataset.


Example
                  
                    from basata.datasets import get_data
                    df = get_data("titanic")

                  
                

Supervised ML

Null

null(DataFrame)

The null function gives you insight to number of null values - Table and Graph.

Parameters

DataFrame: {array-like, sparse matrix}
A pandas DataFrame.


Example
                  
                    from basata.datasets import get_data
                    from basata.supervised import *

                    df = get_data("titanic")
                    null(df)

                  
                

Correlation

correlation(DataFrame)

The correlation function gives you insight to the correlation between numerical features.

Parameters

DataFrame: {array-like, sparse matrix}
A pandas DataFrame.


Example
                  
                    from basata.datasets import get_data
                    from basata.supervised import *

                    df = get_data("titanic")
                    correlation(df)
                  
                

EDA

eda(DataFrame)

The Exploratory Data Analysis (EDA) function is used to explore the data.
Using the sweetviz library, the data is visualized in a variety of ways in an HTML file.

Parameters

DataFrame: {array-like, sparse matrix}
A pandas DataFrame.


Example
                  
                    from basata.datasets import get_data
                    from basata.supervised import *

                    df = get_data("titanic")
                    eda(df)
                  
                

FID

FID(DataFrame, ML_model, plot=True, length=5, height=5)

Function for creating a Feature Importance Dataframe (FID) and plotting it.
Only models with a feature_importances_ attribute are supported.

Parameters

DataFrame: {array-like, sparse matrix}
A pandas DataFrame.

ML_model: {object}
A machine learning model.

plot: {boolean}
Whether to plot the FID or not.

length: {int}
The length of the plot.

height: {int}
The height of the plot.


Example
                  
                    from basata.datasets import get_data
                    from basata.supervised import *
                    from sklearn.ensemble import RandomForestClassifier
                    from sklearn.model_selection import train_test_split

                    df = get_data("titanic")

                    X = df.drop(columns=["Survived"])
                    y = df["Survived"]

                    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

                    model = RandomForestClassifier(n_estimators=100, random_state=42)
                    model.fit(X_train, y_train)

                    FID(X_train, model)
                  
                

Compare Models

compare_models(X, y, random_seed=None, classification=True)

Function for comparing feature importance of different ML models.
Only models with a feature_importances_ attribute are supported (Random Forest, Gradient Boosting, AdaBoost, Decision Tree, XGBoost, CatBoost).

Parameters

X: {array-like, sparse matrix}
The training input samples.

y: array-like of shape (n_samples,)
The target values (class labels in classification, real numbers in regression).

random_seed: {int}
The random seed for the models.

classification: {boolean}
Whether the models are classification or regression.


Example
                  
                  Work in progress...
                  
                

AutoML

automl(X_test, X_train, y_train, y_test, random_seed=None, classification=True)

Function for evaluating classification and regressional models.

Parameters

X_test/X_train: {array-like, sparse matrix}
The training input from the training/test set.

y_test/y_train: array-like of shape (n_samples,)
The target values in the training/test set (class labels in classification, real numbers in regression).

random_seed: {int}
The random seed for the models.

classification: {boolean}
Whether the models are classification or regression.


Example
                  
                  Work in progress...
                  
                

Tuning

tuning(X, y, model='rf', GridSearch=False, scoring=accuracy_score, random_seed=None, classification=True)

This function is used to tune the hyperparameters of a machine learning model using GridSearchCV or RandomizedSearchCV. It takes as input the training data, the target variable, the model to be tuned, the scoring function, the random seed and the classification flag.

The function returns the best model and the best hyperparameters.

Parameters

X: {array-like, sparse matrix}
The training input samples.

y: array-like of shape (n_samples,)
The target values (class labels in classification, real numbers in regression).

model: string
The model to be tuned. Models: 'rf', 'gbdt', 'ab', 'dt', 'knn', 'svm', 'mlp', 'xgb', 'cb', 'lgbm'.

GridSearch: boolean
If True, GridSearchCV is used to tune the hyperparameters, else RandomizedSearchCV is used.

scoring: string
The scoring function to be used. Scoring: accuracy_score, precision_score, recall_score, f1_score, mean_squared_error, r2_score

random_seed: int
The random seed to be used.

classification: boolean
If True, classification is used.


Example
                    
                      from basata.datasets import get_data
                      from basata.supervised import *
                      from sklearn.model_selection import train_test_split
  
                      df = get_data("titanic")
  
                      X = df.drop(columns=["Survived"])
                      y = df["Survived"]
  
                      X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
  
                      tuning(X_train, y_train, model='rf', GridSearch=True, scoring=accuracy_score)
                    
                  

XAI

xai(X_train, X_test, model=None, mode='classification', observation=0, feature_selection='none')

The eXplainable AI (XAI) function is used to explain the model's predictions. The model uses LIME to explain the predictions of the model.

Parameters

X_train: {array-like, sparse matrix}
The training input samples.

X_test: {array-like, sparse matrix}
The test input samples.

model: {model}
The model to be explained.

mode: {string}
The mode of the model. Can be 'classification' or 'regression'.

observation: {int}
The observation to be explained.

feature_selection: {string}
The feature selection method. Can be 'none', 'lasso_path' or 'auto'.


Example
                  
                    from basata.datasets import get_data
                    from basata.supervised import *
                    from sklearn.ensemble import RandomForestClassifier
                    from sklearn.model_selection import train_test_split

                    df = get_data("titanic")

                    X = df.drop(columns=["Survived"])
                    y = df["Survived"]

                    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

                    model = RandomForestClassifier(n_estimators=100, random_state=42)
                    model.fit(X_train, y_train)

                    xai(X_train, X_test, model=model)
                  
                

Time Series

Diff A

...

Diff B

...

Diff C