Endogeneity in statistical inference and forecasting

This project is about statistical inference and predictions in the face of identification problems (endogeneity), you can expect: data generation according to own-defined Data Generation Process endogeneity problems analysis (positive and negative mediators, confounders) OLS models instrumental variables simulation notebook with Python code project report GitHub repository PS. Because I couldn’t find suitable and free photo for the background I created it myself :)

Post

News headers classification - NLP

This project is about news headers classification (text mining field), but the project core was preprocessing and feature engineering, you can expect: target variable analysis a lot of data preprocessing (cleaning with regex, stoplist building, lemmatization, all verified with word clouds) a lot of feature engineering titles length measures (before and after preprocessing, mean number of words, normalized) sentiment estimated with VADER (normalized) named entity recognition with SpaCy (groups of tags, most frequent tags) n-grams (most frequent unigrams and bigrams) GloVe word embeddings (100-dimensional vectors) model validation and architekture choosing final two input neural network architecture: first branch based on pretrained GloVe embeddings, 1D Convolutions and 1D Max pooling second branch based on engineered features (65 in total, sparse) and dense layer concatenation to main branch with two more dense layers with regularization and dropout softmax classifier with 4 outputs model evaluation and amazing test set results (at least 90% accuracy for each class, test set confusion matrix below) Jupyter Notebook in English with ton of python code :) GitHub repository

Post

Cross-sectional tabular data modelling with CNNs

This project is a very big part of my Master Thesis at WNE UW, you can expect: data preprocessing (missing detection and imputation with earlier developed RF based algorithm, data manipulation, image to vector conversion using Monte Carlo simulation) data analysis and transforms (balance check, distributions analysis (box plots and histograms), outliers reduction using quantile clipping, Yeo-Johnson power transform, normalization, special normalization variant for CNNs) n-Fold Cross Validation study on training set and testing on test set several algorithms (Logits, Random Forests, XGBoosts, Feedforward Neural Networks, Convolutional Neural Networks) hyperparameters optimization for Logits (regularization only), RFs and XGBs Networks building (experiments based on results and learning curves) optimized Inception modules for CNNs automatic feature generation method for enlarging CNNs inputs, based on composing sampled features and arithmetic operations from created discrete probability distributions many times CV results comparison (Wilcoxon testing) models training on the entire training set (specifications chosen in CV) models testing and comparison study on 5 datasets - companies bankruptcy prediction for 1-5 years forecast horizons (in progress) Python notebooks in English (a ton of Python code) GitHub repository

Post

IMDB movie reviews - sentiment analysis

This project is about sentiment analysis (text mining field), you can expect: data preparation (nulls and balance check, regex cleaning, label decoding, review length constraint, tokenization, vocabulary building) working with pretrained word embeddings (GloVe) data analysis (embedding coverage, detailed cleaning with regex, stop words, word clouds and count plots) modelling (CNNs, RNNs, VADER) with Cross Validation and error analysis (confusion matrix, misclassification examples analysis) experiments (review length and GloVe dimension sensitivity) 86% test set accuracy Jupyter Notebook report in English (a lot of Python code) GitHub repository

Post

Currency exchange - markov switching model

This project is about float, fixed and mixed exchange rates differences, you can expect: working on the real data (from fxtop.com) analysis period selection data analysis (realizations plots) cointegration testing (two-step Engle-Granger procedure (1987)) statistical tests (Augmented Dicky-Fuller, Breusch-Godfrey, F, Ljung-Box) model selection (information criteria) Markov switching modelling hypotheses verification (rate swings) inference and interpretations quality paper-style report in Polish GitHub repository

Post

Can PCA extract important informations from non-significant features? Neural Network case

This project is about boosting Neural Networks with PCA (other ML algorithms as benchmarks), you can expect: data preparation (renaming labels, balance check, standarization) Random Forest based data imputation algorithm development n-Fold Cross Validation study Machine Learning algorithms (Random Forest, XGBoost with hiperparameters optimization) 6 feature selection methods to spot non-significant features (RF importance, Mutual Information, Spearman correlation between features and with target, General to Specific econometrics procedure, Lasso logistic regression) Neural Networks development (architecture, optimizers, activations, regularization, dropout, batch norm, hyperparameters) Principal Component Analysis of the dataset PCA integration with Nets in CV hypothesis verification using the Wilcoxon test for equality of medians models comparison Python notebook in English (a ton of Python code) quality paper-style report in English project presentation in English GitHub repository

Post

Rainfall modelling with OLS and Kernel Regression

This project is about Rainfall modelling and assumptions testing, you can expect: working on real data from Polish Institute of Meteorology and Weather Management data cleaning and analysis econometric modelling (OLS, Kernel Regression) OLS assumptions testing (RESET (several alternatives), Breusch-Pagan, Breusch-Godfrey, Jarque-Bera, Rescaled Moments, VIFs) model selection (information criteria) performance measurement (MSE, RMSE, MAE, MAPE, R squared and adjusted R squared, F test, scatterplots) data transformations (Principal Component Analysis, Box Cox power transform) “forecasting” (see the report why quotation marks used ;)) discussion about endogeneity Python notebook (and many HTML files with models for many weather stations) quality paper-style report in Polish GitHub repository

Post

Cryptocurrencies portfolio

This project is about conditional variance function and Value at Risk estimation for cryptocurrencies portfolio, you can expect: cryptocurrencies portfolio building with market capitalization weighting principle data scrapping from coinmarketcap.com time series EDA (realizations plot, returns plot, ACFs, PACFs, returns distribution, descriptive statistics) hypotheses testing (Durbin-Watson, ARCH LM, Jarque-Bera, Ljung-Box tests) time series modelling (ARMA-GARCH family models, residuals analysis, model selection, rolling window estimation) Value at Risk estimation, sensitivity analysis forecasting R markdown report in Polish GitHub repository

Post

Churn Modelling

This project is about churn modelling (binary classification task) with Machine Learning and Neural Networks in Python, you can expect: statistical analysis (t-student and Jarque-Bera tests, Correlations, Mutual Information) feature selection (feature importances, sensitivity, statistical methods) feature engineering (monotonic transforms, binarization, label encoding, interactions) lots of ML algorithms (Logits, Random Forests, XGBoosts, SVMs, kNNs, Neural Nets, ensembling) n-Folds Cross Validation study bootstrap simulation several notebooks in Polish (lots of Python code) GitHub repository

Post

Companies bankruptcy modelling with econometric methods

This project is about companies bankruptcy probability modelling with econometric methods, you can expect: paper-style report (in Polish) with quality tables, literature review, methodology description etc. statistical analysis (distributions, correlations, VIFs) econometric modelling (linear probability model, logit regression, probit regression) hypotheses testing (t-student, z tests, linktest) marginal effects (computation, interpretation) ROC curves cutoff optimization bootstrap simulation (Altman Z-Score follow up) huge Jupyter Notebook in Polish (lots of Python code) GitHub repository