News headers classification - NLP

This project is about news headers classification (text mining field), but the project core was preprocessing and feature engineering, you can expect: target variable analysis a lot of data preprocessing (cleaning with regex, stoplist building, lemmatization, all verified with word clouds) a lot of feature engineering titles length measures (before and after preprocessing, mean number of words, normalized) sentiment estimated with VADER (normalized) named entity recognition with SpaCy (groups of tags, most frequent tags) n-grams (most frequent unigrams and bigrams) GloVe word embeddings (100-dimensional vectors) model validation and architekture choosing final two input neural network architecture: first branch based on pretrained GloVe embeddings, 1D Convolutions and 1D Max pooling second branch based on engineered features (65 in total, sparse) and dense layer concatenation to main branch with two more dense layers with regularization and dropout softmax classifier with 4 outputs model evaluation and amazing test set results (at least 90% accuracy for each class, test set confusion matrix below) Jupyter Notebook in English with ton of python code :) GitHub repository

Post

IMDB movie reviews - sentiment analysis

This project is about sentiment analysis (text mining field), you can expect: data preparation (nulls and balance check, regex cleaning, label decoding, review length constraint, tokenization, vocabulary building) working with pretrained word embeddings (GloVe) data analysis (embedding coverage, detailed cleaning with regex, stop words, word clouds and count plots) modelling (CNNs, RNNs, VADER) with Cross Validation and error analysis (confusion matrix, misclassification examples analysis) experiments (review length and GloVe dimension sensitivity) 86% test set accuracy Jupyter Notebook report in English (a lot of Python code) GitHub repository