A Summary of text feature engineering


Feature Engineering Methods for Text Data

Text data is in forms of words, phrases, sentences and documents. A set of documents is corpus as we know.

  1. Pre-processing To clean up text data, here are some points: 1.1. Removing tags: like HTML tags
    1.2. Removing accented characters: like é to e
    1.3. Removing special characters: punctual tokens
    1.4. Stemming and lemmatization
    1.5. Removing stopwords

  2. Processing To convert text to number, here are some points: 2.1. Count Based: Bag-of-Word (N-Gram)
    2.2. Word Frequence Based: TF-IDF
    2.3. Word embedding: word2vect
    2.4. Topic Modelling: LDA