I recently wanted to plot a facet grip plot of all the columns in a Pandas dataframe.
This functionality is very useful in other statistical packages like (e.g. SAS and R), to do some initial data exploration and can be done in Pandas using the hist() method If all the columns are numerical, but this ignores columns which contain categorical data.
There are probably already libraries that can achieve this but I wrote a short function using matplotlib as follows. For information, I was using the SAS buytest dataset.
1. Specify the list of columns you want to plot
# which columns from the data frame to plot columns_to_plot = df.ix[:, df.columns != 'Customer ID'].columns
2. Depending on how many columns of data you have, work out the number of facet plots and their configuration. This assumes five columns which fits quite will in a Jupyter notebook
# facet plot dimensions plot_columns = 5 plot_rows =math.ceil(len(columns_to_plot)/plot_columns)
3. Create the plot figure
# create figure fig, ax = plt.subplots(plot_rows, plot_columns, figsize=(15,20)) fig.subplots_adjust(hspace=0.6) # adjust vertical spacing between plots
4. Iterate through the list of columns with the first operation being to take the column title and split it into four word lines so that it better fits above each chart.
# iterate through columns and create each chart for i, column in enumerate(columns_to_plot): # split the title into four-word long lines title = column.split(' ') title = [' '.join(title[i:i+4]) for i in range(0, len(title), 4)] title = '\n'.join(title)
5. Because some of the data is categorical, we need to calculate the value counts for each column in order to plot the histogram
# bin the data to create histograms for numeric and categorical data df_ = df[column].value_counts().reset_index().sort_values('index').set_index('index')
6. And, because some of the index values will be strings, if the column wasn't numeric, we create numerical x_ticks to plot against.
# create integer list of x_ticks. Can't plot string values x_ticks = [j for j in range(0, len(df_.index))]
7. But, some of the indexes contain a lot of individual values which will clutter the x axis, so we need to work out the interval of index values to plot.
# we only want about 10 ticks on the x_axis so calcualte the tick interval x_label_count = 1 if len(x_ticks) < 10 else math.ceil(len(x_ticks)/10)
8. Now, we can plot the data and set the title to that of the column, but with four word line lengths
# plot the data ax[int(i/plot_columns), i%plot_columns].bar(x_ticks, df_.values) ax[int(i/plot_rows), i%plot_columns].set_title(title)
9. Finally, we need to replace the numeric x_ticks with what the actual index values were, and also adjust their position so they fit under the middle of the histogram bars.
# use the original index data (e.g. including strings) as labels on the x_axis # but, offset their position by 0.4 time the interval between ticks (assumes bars are 0.8 wide) x_ticks = [x + (x_ticks - x_ticks)*0.4 for x in x_ticks] ax[int(i/plot_rows), i%plot_columns].set_xticks(x_ticks[::x_label_count]); ax[int(i/plot_rows), i%plot_columns].set_xticklabels(df_.index[::x_label_count], rotation=90);
The full code is here: https://gist.github.com/quizzicol/b96ab6b3129e25a0ba47bb69237fdb48