text_univariate_summary

text_univariate_summary(data: DataFrame, column: str, fig_height: int = 6, fig_width: int = 18, fontsize: int = 15, color_palette: Optional[str] = None, top_ngrams: int = 10, compute_ngrams: bool = True, remove_punct: bool = True, remove_stop: bool = True, lower_case: bool = True, interactive: bool = False) Tuple[DataFrame, Figure]

Creates a univariate EDA summary for a provided text variable column in a pandas DataFrame. Currently only supports English.

For the provided column produces:
  • histograms of token and character counts across entries

  • boxplot of document frequencies

  • countplots with top unigrams, bigrams, and trigrams

Parameters
  • data – Dataset to perform EDA on

  • column – A string matching a column in the data

  • fig_height – Height of the plot in inches

  • fig_width – Width of the plot in inches

  • fontsize – Font size of axis and tick labels

  • color_palette – Seaborn color palette to use

  • top_ngrams – Maximum number of ngrams to plot for the top most frequent unigrams to trigrams

  • compute_ngrams – Whether to compute and display most common ngrams

  • remove_punct – Whether to remove punctuation during tokenization

  • remove_stop – Whether to remove stop words during tokenization

  • lower_case – Whether to lower case text for tokenization

  • interactive – Whether to display figures and tables in jupyter notebook for interactive use

Returns

Tuple containing matplotlib Figure drawn and summary stats DataFrame