TF-IDF | Notion

TF-IDF: Term Frequency-Inverse Document Frequency Analysis
├── Introduction
│   └── What is TF-IDF?
├── Setting Up the Environment
│   ├── Importing Libraries
│   └── Generating Sample Text Data
├── Implementing TF-IDF
│   ├── Data Preparation
│   ├── Applying TF-IDF
│   └── Understanding the TF-IDF Matrix
├── Visualization
│   └── Visualizing TF-IDF Scores
└── Conclusion
    └── Advantages and Applications

1. Introduction

What is TF-IDF?

TF-IDF is a statistical measure used to evaluate the importance of a word in a document, which is part of a corpus. It's a widely used technique in information retrieval and text mining.

2. Setting Up the Environment

Importing Libraries

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

Generating Sample Text Data

# Sample text data
documents = [
    "The quick brown fox jumped over the lazy dog.",
    "The dog slept under the veranda.",
    "John and Mary went to the market to buy bread and jam.",
    "The lazy dog woke up and chased the quick brown fox."
]

3. Implementing TF-IDF

Data Preparation

Preparing and cleaning text data for TF-IDF transformation.

Applying TF-IDF

# Initializing TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Fitting and transforming the documents
tfidf_matrix = vectorizer.fit_transform(documents)

# Creating a DataFrame for the TF-IDF matrix
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

Understanding the TF-IDF Matrix

Analyzing the generated TF-IDF values to understand word significance.

1. Introduction

What is TF-IDF?

2. Setting Up the Environment

Importing Libraries

Generating Sample Text Data

3. Implementing TF-IDF

Data Preparation

Applying TF-IDF

Understanding the TF-IDF Matrix

4. Visualization

Visualizing TF-IDF Scores