text_univariate_summary

text_univariate_summary(data: DataFrame, column: str, fig_height: int = 6, fig_width: int = 18, fontsize: int = 15, color_palette: Optional[str] = None, top_ngrams: int = 10, compute_ngrams: bool = True, remove_punct: bool = True, remove_stop: bool = True, lower_case: bool = True, interactive: bool = False) → Tuple[DataFrame, Figure]

Creates a univariate EDA summary for a provided text variable column in a pandas DataFrame. Currently only supports English.

For the provided column produces:

histograms of token and character counts across entries
boxplot of document frequencies
countplots with top unigrams, bigrams, and trigrams

Parameters

data – Dataset to perform EDA on
column – A string matching a column in the data
fig_height – Height of the plot in inches
fig_width – Width of the plot in inches
fontsize – Font size of axis and tick labels
color_palette – Seaborn color palette to use
top_ngrams – Maximum number of ngrams to plot for the top most frequent unigrams to trigrams
compute_ngrams – Whether to compute and display most common ngrams
remove_punct – Whether to remove punctuation during tokenization
remove_stop – Whether to remove stop words during tokenization
lower_case – Whether to lower case text for tokenization
interactive – Whether to display figures and tables in jupyter notebook for interactive use

Returns

Tuple containing matplotlib Figure drawn and summary stats DataFrame