Cross-sectional tabular data modelling with CNNs
This project is a very big part of my Master Thesis at WNE UW, you can expect:
- data preprocessing (missing detection and imputation with earlier developed RF based algorithm, data manipulation, image to vector conversion using Monte Carlo simulation)
- data analysis and transforms (balance check, distributions analysis (box plots and histograms), outliers reduction using quantile clipping, Yeo-Johnson power transform, normalization, special normalization variant for CNNs)
- n-Fold Cross Validation study on training set and testing on test set
- several algorithms (Logits, Random Forests, XGBoosts, Feedforward Neural Networks, Convolutional Neural Networks)
- hyperparameters optimization for Logits (regularization only), RFs and XGBs
- Networks building (experiments based on results and learning curves)
- optimized Inception modules for CNNs
- automatic feature generation method for enlarging CNNs inputs, based on composing sampled features and arithmetic operations from created discrete probability distributions many times
- CV results comparison (Wilcoxon testing)
- models training on the entire training set (specifications chosen in CV)
- models testing and comparison
- study on 5 datasets - companies bankruptcy prediction for 1-5 years forecast horizons (in progress)
- Python notebooks in English (a ton of Python code)