{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Univariate Text Summary\n\nExample of univariate eda summary for a text variable\n\nThe text summary computes the following:\n\n- Histogram of # of tokens / document\n- Histogram of # of characters / document\n- Boxplot of # of unique observations of each document\n- Countplots for the most common unigrams, bigrams, and trigams\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import warnings\n\nwarnings.filterwarnings(\"ignore\")\n\nimport pandas as pd\nimport intedact\nimport nltk\nimport plotly\n\nnltk.download(\"punkt\")\nnltk.download(\"stopwords\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Here we take a look at the summaries for GDPR violations.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "data = pd.read_csv(\n    \"https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-04-21/gdpr_violations.tsv\",\n    sep=\"\\t\",\n)\n\nfig = intedact.text_summary(data, \"summary\", fig_width=700)\nplotly.io.show(fig)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "By default, the summary does a lot of text cleaning: removing punctuation and stop words, lower casing. We can\nturn all of these off.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "fig = intedact.text_summary(\n    data,\n    \"summary\",\n    remove_stop=False,\n    remove_punct=False,\n    lower_case=False,\n    fig_width=700\n)\nplotly.io.show(fig)"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.9.12"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}