TG CHEN
Create value from data, maximise the value of data

Create Your First Project
Start adding your projects to your portfolio. Click on "Manage Projects" to get started
Word Embedding in FinTech Literature
Project type
Personal Project
Date
Jul-Sep 2023
Location
Nottingham, UK
Skills
- Python (sklearn, NLTK, genism, matplotlib): main coding of PDF preprocessing, NLP modeling, data analysis
- Microsoft Excel: generating plots
- Microsoft PowerPoint: presentation
- draw.io: generating plots
This is a summer project for the MSc dissertation,
"Comparative Analysis of Word Embedding Methods for Citation Sentence Matching in FinTech Literature"
- Aim: finding the best text embedding method for fintech literature
- Data collection: collecting 3,500 fintech-related scientific articles
- Data structuring: converting PDFs to plain texts and matching citation sentences to the reference article
- Data preprocessing: handling special characters(ex: ligature), applying tokenisation, removing stop words and applying lemmatisation to construct our own fintech literature dataset
- 8 text embedding methods are introduced to evaluate the similarity between a citation sentence and its reference article as performances of embedding methods
- Including TF-IDF, LSA, word2vec, GloVe, FastText, ELMo, USE, and BERT
- generating a 10K words dissertation and a 15-minute presentation





