News headers classification - NLP
This project is about news headers classification (text mining field), but the project core was preprocessing and feature engineering, you can expect:
- target variable analysis
- a lot of data preprocessing (cleaning with regex, stoplist building, lemmatization, all verified with word clouds)
- a lot of feature engineering
- titles length measures (before and after preprocessing, mean number of words, normalized)
- sentiment estimated with VADER (normalized)
- named entity recognition with SpaCy (groups of tags, most frequent tags)
- n-grams (most frequent unigrams and bigrams)
- GloVe word embeddings (100-dimensional vectors)
- model validation and architekture choosing
- final two input neural network architecture:
- first branch based on pretrained GloVe embeddings, 1D Convolutions and 1D Max pooling
- second branch based on engineered features (65 in total, sparse) and dense layer
- concatenation to main branch with two more dense layers with regularization and dropout
- softmax classifier with 4 outputs
- model evaluation and amazing test set results (at least 90% accuracy for each class, test set confusion matrix below)
- Jupyter Notebook in English with ton of python code :)