Data analysis with one command line

An in-depth analysis can take days or even months. However, with a single Python command, we can analyze to get a global view of our data.

Stefan Daniel • 6 min read

Before manipulating data in the Big Data world or creating predicting models that solve our problems, we have to check what data we use and its potential, because bad data produce bad solutions.

GIGO Concept

There are many techniques for data cleansing, feature removal, statistical analysis, descriptive analysis, among others. However, we can obtain a first exploratory data analysis using the pandas profiling package to know how to orient our analysis in a fast way.

Let's Enjoy!

Pandas Profiling Execution

First, we have to install the package.

pip install pandas-profiling

For this example, we will use the New York Airbnb dataset which is available in the following link: https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data

We create a Pandas dataframe from the .csv dataset.

import pandas as pd from pandas
from pandas_profiling import ProfileReport

df = pd.read_csv('AB_NYC_2019.csv')

To create the report with the analysis, it's very simple, use the command below.

profile = ProfileReport(df, title="New York Report")

To view the report, you can run it in Jupyter or directly save it as a HTML document.

#Jupyter
profile.to_widgets()

#Save HTML
profile.to_file("your_report.html")

What statistical metrics does the report provide?

Overview

It shows an overview of the dataframe data and it indicates common warnings that exist in our data which can damage our model or analysis.

Overview Section

Variables

Price variable in the variables section

Unlike the general view, in the variables section, we find a detailed analysis of each variable and its distribution by statistical indicators.

Interactions & Correlations

Correlations section

In these sections, we can find the correlation and interactions between the variables indicating the dependence of each pair of variables and their density

Missing values & Sample

Missing values section

These sections are very useful to check which features have more missing values and know what data must be transformed afterwards and, also, show some data samples.

Conclusion

This package allows us to have a small first contact to know how to use the data. Also, we have used the default configuration using a single command but pandas profiling has a multitude of configurations to adapt the report to our requirements.