News headers classification - NLP

May 26, 2021

This project is about news headers classification (text mining field), but the project core was preprocessing and feature engineering, you can expect:

target variable analysis
a lot of data preprocessing (cleaning with regex, stoplist building, lemmatization, all verified with word clouds)
a lot of feature engineering
- titles length measures (before and after preprocessing, mean number of words, normalized)
- sentiment estimated with VADER (normalized)
- named entity recognition with SpaCy (groups of tags, most frequent tags)
- n-grams (most frequent unigrams and bigrams)
GloVe word embeddings (100-dimensional vectors)
model validation and architekture choosing
final two input neural network architecture:
- first branch based on pretrained GloVe embeddings, 1D Convolutions and 1D Max pooling
- second branch based on engineered features (65 in total, sparse) and dense layer
- concatenation to main branch with two more dense layers with regularization and dropout
- softmax classifier with 4 outputs
model evaluation and amazing test set results (at least 90% accuracy for each class, test set confusion matrix below)
Jupyter Notebook in English with ton of python code :)

GitHub repository