Welcome to intedact’s documentation!

intedact: Interactive EDA

https://img.shields.io/pypi/v/intedact.svgPyPI Version https://img.shields.io/badge/license-MIT-blue.svgLicense

Interactive EDA for pandas DataFrames directly in your Jupyter notebook. intedact makes common, standardized EDA visual summaries available in an interactive manner with one function call. Using ipywidgets, you can quickly cycle through different variables or combinations of variables and produce useful visual summaries when exploring the dataset. Each summary will have additional plot parameters you can tweak to adjust the visualizations to work for your dataset.

Full documentation at intedact.readthedocs.io

Getting Started

Installation

Install via pip:

pip install intedact

Download the following nltk resources for the ngram text summaries.

import nltk
nltk.download('punkt')
nltk.download('stopwords')

Univariate EDA

Univariate EDA refers to the process of visualizing and summarizing a single variable.

For interactive univariate EDA simply import the univariate_eda_interact function in a jupyter notebook and pass in a pandas dataframe:

from intedact import univariate_eda_interact
univarate_eda_interact(
    data, notes_file="optional_file_to_save_notes_to.json"
)

At the top level, one selects the column and the summary type for that column to display. To explore the full dataset, just toggle through each of the column names. Current supported summary types:

  • categorical: Summarize a categorical or low cardinality numerical column

  • numeric: Summarize a high cardinality numerical column

  • datetime: Summarize a datetime column

  • text: Summarize a free form text column

  • collection: Summarize a column with collections of values (i.e. lists, tuples, sets, etc.)

  • url: Summarize a column containing urls

For each column, one can then adjust parameters for the given summary type to fit your particular dataset. These summaries try to automatically set good default parameters, but sometimes you need to make adjustments to get the full picture.

See the documentation for examples of how to statically call the individual univariate summary functions.

Bivariate EDA

Bivariate EDA refers to the process of visualizing and summarizing a pair of variables.

Like with univariate EDA, simply import the bivariate_eda_interact function in a jupyter notebook and pass in a dataframe:

from intedact import bivariate_eda_interact
bivarate_eda_interact(
    data, notes_file="optional_file_to_save_notes_to.json"
)

At the top level, one selects a pair of columns to display (one as the independent and the second as the dependent). Current supported summary types:

  • categorical-categorical: Summarize a pair of categorical columns

  • numeric-categorical: Summarize an independent numeric variable against a dependent categorical variable

  • categorical-numeric: Summarize an independent categorical variable against a dependent numeric variable

  • numeric-numeric: Summarize a pair of numeric columns

Design Philosophy

The motivation for intedact comes from the following observations:

  1. There is a standard set of visualizations that should be always applied to different individual and combinations of variables depending on their type when performing EDA. For example, it is always good to visualize the distribution of a numerical variable using a histogram. intedact’s goal is to save you from having to constantly copy-paste this code across columns, projects, etc.

  2. These visualizations often need some degree of adjustment to get the information you need. For example, really skewed variables with outliers might need some outlier filtering and/or a log transform to actually be able to visualize the histogram properly. intedact’s goal is to give you additional control over the visualization with interactive widgets that you can repeatedly adjust until you get the visualization you need.

Given the above, intedact tries to produce visualizations that give you the visual understanding you are seeking for 95% of cases when you pass in the defaults. For the other 5%, we give you additional parameters you can tweak via the widgets so you can still get the insights you need without having to leave the interface.

intedact is not a single click EDA summary generation tool. Many of those exist and we recommend pairing them with intedact (pandas-profiling is a great one for example). Where these fall short, is they don’t focus on the visualizations and give you the power to adjust them to your dataset when the defaults don’t suffice. Use intedact when you want to dig deeper and really visually understand a variable or the relationship between variable(s).

Univariate Summary Examples

Univariate URL Summary

Example of univariate eda summary for an url variable

The URL summary computes the following:

  • Countplot for the invididual unique urls

  • Countplot for the domains of the urls

  • Countplot for the domain suffixes of the urls

  • Countplot for the file types of the urls

import pandas as pd
import plotly

import intedact

Here we take a look at the source URL’s for countries GDPR violations recordings.

data = pd.read_csv(
    "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-04-21/gdpr_violations.tsv",
    sep="\t",
)

fig = intedact.url_summary(data, "source", fig_width=700)
plotly.io.show(fig)

Total running time of the script: ( 0 minutes 3.239 seconds)

Gallery generated by Sphinx-Gallery

Univariate Numeric Summary

Example of univariate eda summary for a numeric variable.

The numeric summary computes the following:

  • A histogram

  • A boxplot

import pandas as pd
import plotly

import intedact

Here we take a look at some GDPR violation prices and showcase some parameters:

  • log transformation

  • outlier filtering

  • custom bin count

data = pd.read_csv(
    "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-04-21/gdpr_violations.tsv",
    sep="\t",
)
fig = intedact.numeric_summary(
    data, "price", bins=20, transform="log", upper_quantile=0.95, fig_width=700
)
plotly.io.show(fig)

Total running time of the script: ( 0 minutes 0.196 seconds)

Gallery generated by Sphinx-Gallery

Univariate Collection Summary

Example of univariate eda summary for a collection variable (lists, tuples, sets, etc.).

The collection summary computes the following:

  • Three separate countplots: - Counts for all the unique collections - Counts for all the unique entries - Counts for the number of entries in each collection

import pandas as pd
import plotly

import intedact

Here we take a look at which articles of GDPR countries violated. We first have to process the column so it is a list and not a string. One can also choose whether to sort the values (ignore order of how they’re listed) and remove duplicates (only consider unique entries)

data = pd.read_csv(
    "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-04-21/gdpr_violations.tsv",
    sep="\t",
)
data["article_violated"] = data["article_violated"].apply(lambda x: x.split("|"))

fig = intedact.collection_summary(data, "article_violated", fig_width=700)
plotly.io.show(fig)

Total running time of the script: ( 0 minutes 0.208 seconds)

Gallery generated by Sphinx-Gallery

Univariate Categorical Summary

Example of univariate eda summary for a categorical variable.

The categorical summary computes the following:

  • A countplot with counts and percentages by level of the categorical

  • A table with summary statistics

import pandas as pd
import plotly

import intedact

For our first example, we plot the name of countries who have had GDPR violations. By default, the plot will try to order and orient the columns appropriately. Here we order by descending count and the plot was flipped horizontally due to the number of levels in the variable.

data = pd.read_csv(
    "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-04-21/gdpr_violations.tsv",
    sep="\t",
)
fig = intedact.categorical_summary(data, "name", fig_width=700)
plotly.io.show(fig)

We can do additional things such as condense extra columns into an “Other” column, add a bar for missing values, and change the sort order to sort alphabetically.

fig = intedact.categorical_summary(
    data,
    "name",
    include_missing=True,
    order="sorted",
    max_levels=5,
    fig_width=700,
)
plotly.io.show(fig)

Out:

No missing values for column: name

Total running time of the script: ( 0 minutes 0.222 seconds)

Gallery generated by Sphinx-Gallery

Univariate Text Summary

Example of univariate eda summary for a text variable

The text summary computes the following:

  • Histogram of # of tokens / document

  • Histogram of # of characters / document

  • Boxplot of # of unique observations of each document

  • Countplots for the most common unigrams, bigrams, and trigams

import nltk
import pandas as pd
import plotly

import intedact

nltk.download("punkt")
nltk.download("stopwords")

Out:

[nltk_data] Downloading package punkt to /Users/mboggess/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/mboggess/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

True

Here we take a look at the summaries for GDPR violations.

data = pd.read_csv(
    "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-04-21/gdpr_violations.tsv",
    sep="\t",
)

fig = intedact.text_summary(data, "summary", fig_width=700)
plotly.io.show(fig)

By default, the summary does a lot of text cleaning: removing punctuation and stop words, lower casing. We can turn all of these off.

fig = intedact.text_summary(
    data,
    "summary",
    remove_stop=False,
    remove_punct=False,
    lower_case=False,
    fig_width=700,
)
plotly.io.show(fig)

Total running time of the script: ( 0 minutes 0.851 seconds)

Gallery generated by Sphinx-Gallery

Univariate Datetime Summary

Example of univariate eda summary for a datetime variable. Here we look at posting times for TidyTuesday tweets.

The datetime summary computes the following:

  • A time seriesplot aggregated according to the ts_freq parameter

  • Barplots showing counts by day of week, month, hour of day, day of month

import pandas as pd
import plotly

import intedact

data = pd.read_csv(
    "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/tidytuesday_tweets/data.csv"
)
data["created_at"] = pd.to_datetime(data.created_at)
fig = intedact.datetime_summary(data, "created_at", fig_width=700)
plotly.io.show(fig)

By default, the summary tries to infer reasonable units for the time series. We can change these by using time unit strings for the ts_freq parameter.

fig = intedact.datetime_summary(data, "created_at", ts_freq="1 day", fig_width=700)
plotly.io.show(fig)

Example of changing plot type, removing trend line, and removing outliers.

fig = intedact.datetime_summary(
    data,
    "created_at",
    ts_type="markers",
    trend_line="none",
    upper_quantile=0.99,
    fig_width=700,
)
plotly.io.show(fig)

Total running time of the script: ( 0 minutes 4.192 seconds)

Gallery generated by Sphinx-Gallery

Gallery generated by Sphinx-Gallery

Bivariate Summary Examples

Bivariate Numeric-Categorical Summary

Example of bivariate eda summary for a numeric independent variable and a categorical dependent variable.

The summary computes the following:

  • Lineplot with fractions for each level of the categorical variable against quantiles of the numeric variable

import pandas as pd
import plotly

import intedact

Here we look at how diamond cut quality changes with carats.

data = pd.read_csv(
    "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/diamonds.csv"
)
fig = intedact.numeric_categorical_summary(
    data, "carat", "cut", num_intervals=5, fig_width=700
)
plotly.io.show(fig)

Total running time of the script: ( 0 minutes 0.692 seconds)

Gallery generated by Sphinx-Gallery

Bivariate Categorical-Numeric Summary

Example of bivariate eda summary for a categorical independent variable and a numeric dependent variable.

The summary computes the following:

  • Overlapping histogram/kde plots of distributions by level

  • Side by side boxplots per level

import pandas as pd
import plotly

import intedact

Here we look at how diamond price changes with cut quality

data = pd.read_csv(
    "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/diamonds.csv"
)
data["cut"] = pd.Categorical(
    data["cut"],
    categories=["Fair", "Good", "Very Good", "Premium", "Ideal"],
    ordered=True,
)
fig = intedact.categorical_numeric_summary(data, "cut", "price", fig_width=700)
plotly.io.show(fig)

Total running time of the script: ( 0 minutes 1.243 seconds)

Gallery generated by Sphinx-Gallery

Bivariate Numeric-Numeric Summary

Example of bivariate eda summary for a pair of numeric variables.

The summary computes the following:

  • A scatterplot with trend line

  • A 2d histogram

  • Boxplots of the dependent variable against quantiles of the independent variable

import pandas as pd
import plotly

import intedact

Here we take a look at relationship between carat and price in the diamonds dataset

data = pd.read_csv(
    "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/diamonds.csv"
).sample(n=10000)
fig = intedact.numeric_numeric_summary(data, "carat", "price", fig_width=700)
plotly.io.show(fig)

By default, it is hard to see much since the distributions are very skewed with outliers. We can tweak the plot to actually visualize the distributions in more detail.

fig = intedact.numeric_numeric_summary(
    data,
    "carat",
    "price",
    upper_quantile1=0.98,
    hist_bins=100,
    num_intervals=10,
    opacity=0.4,
    fig_width=700,
)
plotly.io.show(fig)

Total running time of the script: ( 0 minutes 8.355 seconds)

Gallery generated by Sphinx-Gallery

Bivariate Categorical-Categorical Summary

Example of bivariate eda summary for a pair of categorical variables

The summary computes the following:

  • Categorical heatmap with counts and percentages for each level combo

  • Barplot showing distribution of column2’s levels within each level of column1

  • Lineplot showing distribution of column2’s levels across each level of column1

import pandas as pd
import plotly

import intedact

Here we look at how diamond cut quality and clarity quality are related.

data = pd.read_csv(
    "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/diamonds.csv"
)
data["cut"] = pd.Categorical(
    data["cut"],
    categories=["Fair", "Good", "Very Good", "Premium", "Ideal"],
    ordered=True,
)
data["clarity"] = pd.Categorical(
    data["clarity"],
    categories=["I1", "SI2", "SI1", "VS2", "VS1", "VVS2", "VVS1", "IF"],
    ordered=True,
)
fig = intedact.categorical_categorical_summary(
    data, "clarity", "cut", barmode="group", fig_width=700
)
plotly.io.show(fig)

Total running time of the script: ( 0 minutes 0.585 seconds)

Gallery generated by Sphinx-Gallery

Gallery generated by Sphinx-Gallery

Univariate Summary Functions

Functions for creating univariate EDA summaries.

categorical_summary

Creates a univariate EDA summary for a provided categorical data column in a pandas DataFrame.

numeric_summary

Creates a univariate EDA summary for a high cardinality numeric data column in a pandas DataFrame.

datetime_summary

Creates a univariate EDA summary for a datetime data column in a pandas DataFrame.

text_summary

Creates a univariate EDA summary for a text variable column in a pandas DataFrame.

collection_summary

Creates a univariate EDA summary for a collections column in a pandas DataFrame.

url_summary

Creates a univariate EDA summary for a url column in a pandas DataFrame.

Bivariate Summary Functions

Functions for creating bivariate EDA summaries.

categorical_categorical_summary

Generates an EDA summary of two categorical variables

categorical_numeric_summary

Generates an EDA summary of the relationship between a categorical variable as the independent variable and a numeric variable as the dependent variable.

numeric_categorical_summary

Generates an EDA summary of the relationship of a numeric variable on a categorical variable.

numeric_numeric_summary

Creates a bivariate EDA summary for two numeric data columns in a pandas DataFrame.

Indices and tables