Cross-sectional tabular data modelling with CNNs

This project is a very big part of my Master Thesis at WNE UW, you can expect: data preprocessing (missing detection and imputation with earlier developed RF based algorithm, data manipulation, image to vector conversion using Monte Carlo simulation) data analysis and transforms (balance check, distributions analysis (box plots and histograms), outliers reduction using quantile clipping, Yeo-Johnson power transform, normalization, special normalization variant for CNNs) n-Fold Cross Validation study on training set and testing on test set several algorithms (Logits, Random Forests, XGBoosts, Feedforward Neural Networks, Convolutional Neural Networks) hyperparameters optimization for Logits (regularization only), RFs and XGBs Networks building (experiments based on results and learning curves) optimized Inception modules for CNNs automatic feature generation method for enlarging CNNs inputs, based on composing sampled features and arithmetic operations from created discrete probability distributions many times CV results comparison (Wilcoxon testing) models training on the entire training set (specifications chosen in CV) models testing and comparison study on 5 datasets - companies bankruptcy prediction for 1-5 years forecast horizons (in progress) Python notebooks in English (a ton of Python code) GitHub repository

Post

Can PCA extract important informations from non-significant features? Neural Network case

This project is about boosting Neural Networks with PCA (other ML algorithms as benchmarks), you can expect: data preparation (renaming labels, balance check, standarization) Random Forest based data imputation algorithm development n-Fold Cross Validation study Machine Learning algorithms (Random Forest, XGBoost with hiperparameters optimization) 6 feature selection methods to spot non-significant features (RF importance, Mutual Information, Spearman correlation between features and with target, General to Specific econometrics procedure, Lasso logistic regression) Neural Networks development (architecture, optimizers, activations, regularization, dropout, batch norm, hyperparameters) Principal Component Analysis of the dataset PCA integration with Nets in CV hypothesis verification using the Wilcoxon test for equality of medians models comparison Python notebook in English (a ton of Python code) quality paper-style report in English project presentation in English GitHub repository

Post

Churn Modelling

This project is about churn modelling (binary classification task) with Machine Learning and Neural Networks in Python, you can expect: statistical analysis (t-student and Jarque-Bera tests, Correlations, Mutual Information) feature selection (feature importances, sensitivity, statistical methods) feature engineering (monotonic transforms, binarization, label encoding, interactions) lots of ML algorithms (Logits, Random Forests, XGBoosts, SVMs, kNNs, Neural Nets, ensembling) n-Folds Cross Validation study bootstrap simulation several notebooks in Polish (lots of Python code) GitHub repository