I’m glad you are here.
Recent Projects
Endogeneity in statistical inference and forecasting
This project is about statistical inference and predictions in the face of identification problems (endogeneity), you can expect:
data generation according to own-defined Data Generation Process endogeneity problems analysis (positive and negative mediators, confounders) OLS models instrumental variables simulation notebook with Python code project report GitHub repository
PS. Because I couldn’t find suitable and free photo for the background I created it myself :)
read more
News headers classification - NLP
This project is about news headers classification (text mining field), but the project core was preprocessing and feature engineering, you can expect:
target variable analysis a lot of data preprocessing (cleaning with regex, stoplist building, lemmatization, all verified with word clouds) a lot of feature engineering titles length measures (before and after preprocessing, mean number of words, normalized) sentiment estimated with VADER (normalized) named entity recognition with SpaCy (groups of tags, most frequent tags) n-grams (most frequent unigrams and bigrams) GloVe word embeddings (100-dimensional vectors) model validation and architekture choosing final two input neural network architecture: first branch based on pretrained GloVe embeddings, 1D Convolutions and 1D Max pooling second branch based on engineered features (65 in total, sparse) and dense layer concatenation to main branch with two more dense layers with regularization and dropout softmax classifier with 4 outputs model evaluation and amazing test set results (at least 90% accuracy for each class, test set confusion matrix below) Jupyter Notebook in English with ton of python code :) GitHub repository
read more
Cross-sectional tabular data modelling with CNNs
This project is a very big part of my Master Thesis at WNE UW, you can expect:
data preprocessing (missing detection and imputation with earlier developed RF based algorithm, data manipulation, image to vector conversion using Monte Carlo simulation) data analysis and transforms (balance check, distributions analysis (box plots and histograms), outliers reduction using quantile clipping, Yeo-Johnson power transform, normalization, special normalization variant for CNNs) n-Fold Cross Validation study on training set and testing on test set several algorithms (Logits, Random Forests, XGBoosts, Feedforward Neural Networks, Convolutional Neural Networks) hyperparameters optimization for Logits (regularization only), RFs and XGBs Networks building (experiments based on results and learning curves) optimized Inception modules for CNNs automatic feature generation method for enlarging CNNs inputs, based on composing sampled features and arithmetic operations from created discrete probability distributions many times CV results comparison (Wilcoxon testing) models training on the entire training set (specifications chosen in CV) models testing and comparison study on 5 datasets - companies bankruptcy prediction for 1-5 years forecast horizons (in progress) Python notebooks in English (a ton of Python code) GitHub repository
read more