Univariate Text Summary

Example of univariate eda summary for a text variable

The text summary computes the following:

Histogram of # of tokens / document
Histogram of # of characters / document
Boxplot of # of unique observations of each document
Countplots for the most common unigrams, bigrams, and trigams

import nltk
import pandas as pd
import plotly

import intedact

nltk.download("punkt")
nltk.download("stopwords")

Out:

[nltk_data] Downloading package punkt to /Users/mboggess/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/mboggess/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

True

Here we take a look at the summaries for GDPR violations.

data = pd.read_csv(
    "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-04-21/gdpr_violations.tsv",
    sep="\t",
)

fig = intedact.text_summary(data, "summary", fig_width=700)
plotly.io.show(fig)

By default, the summary does a lot of text cleaning: removing punctuation and stop words, lower casing. We can turn all of these off.

fig = intedact.text_summary(
    data,
    "summary",
    remove_stop=False,
    remove_punct=False,
    lower_case=False,
    fig_width=700,
)
plotly.io.show(fig)

Total running time of the script: ( 0 minutes 0.851 seconds)

Download Python source code: plot_univariate_text_summary.py

Download Jupyter notebook: plot_univariate_text_summary.ipynb

Gallery generated by Sphinx-Gallery