Python Handbook: Descriptive Statistics

Descriptive statistics in SAEPER

For a discussion of various descriptive statistics, see the corresponding chapter in Summary and Analysis of Extension Program Evaluation in R (rcompanion.org/handbook/C_02.html).

Importing packages in this chapter

The following commands will import required packages used in this chapter from libraries and assign them common aliases. You may need install these libraries first.

import io

import numpy as np

import scipy.stats as stats

import pandas as pd

import scipy.stats.mstats as mstats

from pandas.api.types import CategoricalDtype

Descriptive statistics for interval/ratio data

For this example, imagine that Ren and Stimpy have each held eight workshops educating the public about water conservation at home. They are interested in how many people showed up to the workshops.

Data = pd.read_table(sep="\\s+", filepath_or_buffer=io.StringIO("""

Instructor Location Attendees
Ren         North      7
Ren         North     22
Ren         North      6
Ren         North     15
Ren         South     12
Ren         South     13
Ren         South     14
Ren         South     16
Stimpy      North     18
Stimpy      North     17
Stimpy      North     15
Stimpy      North      9
Stimpy      South     15
Stimpy      South     11
Stimpy      South     19
Stimpy      South     23

"""))

### Convert Instructor and Location to category type

Data['Instructor'] = Data['Instructor'].astype('category')

Data['Location'] = Data['Location'].astype('category')

### Display some summary statistics for the data frame

print(Data.info())

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Instructor 16 non-null category

1 Location 16 non-null category

2 Attendees 16 non-null int64

print(Data.describe())

Attendees

count 16.000000

mean 14.500000

std 4.830459

min 6.000000

25% 11.750000

50% 15.000000

75% 17.250000

max 23.000000

print(Data.describe(include='category'))

Instructor Location

count 16 16

unique 2 2

top Ren North

freq 8 8

Functions for sum and count

For a pandas data frame, the sum of a variable can be found with the sum function, and the number of observations can be found with the count function.

Because the data are housed in a data frame, we can use the convention Data['Attendees'] to access the variable Attendees within the data frame Data.

Data['Attendees'].sum()

232

Data['Attendees'].count()

Statistics of location for interval/ratio data

Mean

The mean is simply the sum of the values divided by the number of values.

Data['Attendees'].sum() / Data['Attendees'].count()

14.5

The mean function in pandas will return the mean.

Data['Attendees'].mean()

14.5

Here, with skewed income data, the mean may not be the best representative of the center of the data.

IncomeList = [49000, 44000, 25000, 18000, 32000, 47000, 37000, 45000, 36000, 2000000]

Income = pd.Series(IncomeList)

Income.mean()

233300.0

Median

The pandas package can also calculate the median for a variable in a data frame or for an array.

Data['Attendees'].median()

15.0

For individual arrays, we can use pd.Series to define the array.

Attendees2 = pd.Series([7, 22, 6, 15, 12, 13, 14, 16, 18, 17, 15, 9, 15, 11, 19, 1000])

Attendees2.median()

15.0

In the following example, we’ll change Stimpy’s last Attendees value from 23 to 1000 to show that this doesn’t affect the median.

Attendees2 = pd.Series([7, 22, 6, 15, 12, 13, 14, 16, 18, 17, 15, 9, 15, 11, 19, 1000])

Attendees2.median()

15.0

The median income for the town discussed above is $40,500. Half the families in the town have an income above this amount, and half have an income below this amount. Note that this differs considerably from the mean calculated above.

Income = pd.Series(

[49000, 44000, 25000, 18000, 32000, 47000, 37000, 45000, 36000, 2000000])

Income.median()

40500.0

Mode

For our Ren and Stimpy example, the value 15 occurs three times in Attendees and so is the mode.

Data['Attendees'].mode()

Statistics of variation for interval/ratio data

Standard deviation

Data['Attendees'].std()

4.83045891539648

Income = pd.Series([49000, 44000, 25000, 18000, 32000, 47000, 37000, 45000, 36000, 2000000])

Income.std()

620834.5727056981

Median absolute deviation

Median absolute deviation for variable in a data frame or for an array can be calculated using the stats package from the SciPy module. The scale option is used to adjust the result. For example to return a result that is similar to the standard deviation for normal samples. scale='normal' is equivalent to scale= 0.67449.

stats.median_abs_deviation(Data['Attendees'], scale='normal')

4.447805008228439

stats.median_abs_deviation(Data['Attendees'])

3.0

Income = pd.Series([49000, 44000, 25000, 18000, 32000, 47000, 37000, 45000, 36000, 2000000])

stats.median_abs_deviation(Income, scale='normal')

11119.512520571097

stats.median_abs_deviation(Income)

7500.0

Standard error of the mean

The standard error of the mean can be calculated as the standard deviation divided by the square root of the number of observations. It can also be calculated with SciPy stats.

SD = Data['Attendees'].std()

Length = Data['Attendees'].count()

print(SD / np.sqrt(Length))

1.20761472884912

stats.sem(Data['Attendees'])

1.20761472884912

Five-number summary, quartiles, percentiles

The five-number summary, consisting of the minimum, 25^th percentile, median, 75^th percentile, and maximum, can be calculated with the describe function in pandas.

Summary = Data['Attendees'].describe()

print(Summary)

count 16.000000

mean 14.500000

std 4.830459

min 6.000000

25% 11.750000

50% 15.000000

75% 17.250000

max 23.000000

Quantiles can also be calculated with the quantile function in pandas and with the quantile function in NumPy. The function in NumPy has more options for estimation methods for quantiles.

Data['Attendees'].quantile(0.25)

11.75

### Note that the default method matches the result from the default in R.

Data['Attendees'].quantile(0.25, interpolation='midpoint')

11.5

### In this case the 'midpoint' option reports the midpoint between 11 and 12,

### the 4^th and 5^th observations when Attendees is sorted.

np.quantile(Data['Attendees'], 0.25)

11.75

np.quantile(Data['Attendees'], 0.25, method='midpoint')

11.5

Calculate the 95^th percentile.

Data['Attendees'].quantile(0.95)

22.25

Statistics for grouped interval/ratio data

Continuing to examine the Ren and Stimpy data, summary statistics by group and by combination of groups can be obtained with the describe and groupby functions in pandas.

Summary = Data.groupby('Instructor')['Attendees'].describe()

print(Summary)

count mean std min 25% 50% 75% max

Instructor

Ren 8.0 13.125 5.083236 6.0 10.75 13.5 15.25 22.0

Stimpy 8.0 15.875 4.454131 9.0 14.00 16.0 18.25 23.0

Summary2 = Data.groupby(['Instructor', 'Location'])['Attendees'].describe()

print(Summary2)

count mean std min 25% 50% 75% max

Instructor Location

Ren North 4.0 12.50 7.505553 6.0 6.75 11.0 16.75 22.0

South 4.0 13.75 1.707825 12.0 12.75 13.5 14.50 16.0

Stimpy North 4.0 14.75 4.031129 9.0 13.50 16.0 17.25 18.0

South 4.0 17.00 5.163978 11.0 14.00 17.0 20.00 23.0

Adding variables to a data frame

For this example, we can add the standard error of the mean to summary statistics of the grouped data.

Summary2['sem'] = Summary2['std'] / np.sqrt(Summary2['count'])

print(Summary2)

count mean std ... 75% max sem

Instructor Location ...

Ren North 4.0 12.50 7.505553 ... 16.75 22.0 3.752777

South 4.0 13.75 1.707825 ... 14.50 16.0 0.853913

Stimpy North 4.0 14.75 4.031129 ... 17.25 18.0 2.015564

South 4.0 17.00 5.163978 ... 20.00 23.0 2.581989

Other statistics for grouped variables

The groupby function in pandas can be used to calculate summary statistics not included in describe. Here, the 95^th percentile is calculated for each combination of Instructor and Location.

Summary3 = Data.groupby(['Instructor', 'Location'])['Attendees'].quantile(0.95)

print(Summary3)

Instructor Location

Ren North 20.95

South 15.70

Stimpy North 17.85

South 22.40

Summary3 in this case a series in pandas. We can change the name of variable for convenience.

print(Summary3.info)

Name: Attendees, dtype: float64>

Summary3.name = '95thPercentile'

print(Summary3)

Instructor Location

Ren North 20.95

South 15.70

Stimpy North 17.85

South 22.40

Name: 95thPercentile, dtype: float64

We can also convert Summary3 to a data frame.

Summary4 = pd.DataFrame(Summary3)

print(Summary4)

95thPercentile

Instructor Location

Ren North 20.95

South 15.70

Stimpy North 17.85

South 22.40

Summaries for data frames

Summary statistics for variables in a data frame can be obtained with the pandas functions info and describe.

print(Data.info())

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Instructor 16 non-null category

1 Location 16 non-null category

2 Attendees 16 non-null int64

print(Data.describe())

Attendees

count 16.000000

mean 14.500000

std 4.830459

min 6.000000

25% 11.750000

50% 15.000000

75% 17.250000

max 23.000000

Dealing with missing values

For this example, we’ll change two values of Attendees to missing values. Here, the missing values are coded with NA, which is the convention in R. read_table will convert them to nan values, which is the usual convention in Python for missing values or other values that are “not a number”.

Note that in the summary from the info call, that there are 14 non-null values for Attendees.

Data2 = pd.read_table(sep="\\s+", filepath_or_buffer=io.StringIO("""

Instructor Location Attendees
Ren         North      7
Ren         North     22
Ren         North      6
Ren         North     15
Ren         South     12
Ren         South     13
Ren         South     NA
Ren         South     16
Stimpy      North     18
Stimpy      North     17
Stimpy      North     NA
Stimpy      North      9
Stimpy      South     15
Stimpy      South     11
Stimpy      South     19
Stimpy      South     23
"""))

print(Data2.info())

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Instructor 16 non-null object

1 Location 16 non-null object

2 Attendees 14 non-null float64

By default, many pandas functions will remove nan values by default. If nan values are included in a calculation, the returned result will often be nan.

Data2['Attendees'].mean()

14.5

Data2['Attendees'].mean(skipna=False)

nan

Removing missing values

The pandas function dropna can remove rows with missing values. Observations with no missing values for any variable are sometimes called “complete cases”. The process is sometimes called “listwise deletion”.

Note here that the data frame DataComplete has 14 observations. The two rows with nan observations in Attendees have been removed from the data frame.

DataComplete = Data2.dropna()

print(DataComplete.info())

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Instructor 14 non-null object

1 Location 14 non-null object

2 Attendees 14 non-null float64

The dropna function can also remove any columns, or variables, that have missing values. Note here that the variable Attendees has been removed entirely, but Instructor and Location retain 16 observations.

DataCompleteColumns = Data2.dropna(axis=1)

print(DataCompleteColumns.info())

Data columns (total 2 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Instructor 16 non-null object

1 Location 16 non-null object

Counting missing values

The isnull function will return True if there are any missing values in a data frame or column in a data frame.

Data2.isnull().values.any()

True

Data2['Attendees'].isnull().values.any()

True

Data2['Location'].isnull().values.any()

False

Likewise, the isnull and sum functions can be combined to count the missing values in a data frame or variable in a data frame.

Data2.isnull().sum()

Instructor 0

Location 0

Attendees 2

### Number of nan values in each variable

Data2['Attendees'].isnull().sum()

### Number of nan values in Attendees in Data2

Data2['Location'].isnull().sum()

### Number of nan values in Location in Data2

Missing values in arrays

For a pd.Series, missing values can be designated with None.

Attendees2 = pd.Series([7, 22, 6, 15, 12, 13, None, 16, 18, 17, None, 9, 15, 11, 19, 23])

print(Attendees2.info())

RangeIndex: 16 entries, 0 to 15

Series name: None

Non-Null Count Dtype

-------------- -----

14 non-null float64

### 16 values, 14 of which are non-null

print(Attendees2.isnull().sum())

### 2 null values

print(Attendees2.mean())

14.5

print(Attendees2.mean(skipna=False))

nan

Statistics of shape for interval/ratio data

Returning to the original Ren and Stimpy data set, we can obtain the skewness and kurtosis for Attendees with the skew and kurtosis functions in pandas.

There are different methods to calculate skew and kurtosis, so different software may report different results by default. See the documentation for details. It appears that pandas calculates these statistics in the manner of type=2 in R with psych::describe.

print(Data['Attendees'].skew())

-0.0506986941557648

print(Data['Attendees'].kurtosis())

-0.32657277416461117

Descriptive statistics for ordinal data

To obtain descriptive statistics for ordinal data, the data can be examined as either numeric data or nominal data. In Python, the data could also be coded as ordered category data.

Example of descriptive statistics for ordinal data

For this example, Arthur and Baxter have each held a workshop educating the public about preventing water pollution at home. They are interested in the education level of people who attended the workshops. We can consider education level an ordinal variable. They also collected information on the gender and county of participants.

For this data, we have manually coded Education with numbers in a separate variable Ed.code. These numbers list the education in order: High school < Associate’s < Bachelor’s < Master’s < Ph.D.

Data = pd.read_table(sep="\\s+", filepath_or_buffer=io.StringIO("""

Date Instructor Student Gender County Education EdCode

2015-11-01 "Arthur Read" a female "Elwood" BA 3

2015-11-01 "Arthur Read" b female "Bear Lake" PHD 5

2015-11-01 "Arthur Read" c male "Elwood" BA 3

2015-11-01 "Arthur Read" d female "Elwood" MA 4

2015-11-01 "Arthur Read" e male "Elwood" HS 1

2015-11-01 "Arthur Read" f female "Bear Lake" MA 4

2015-11-01 "Arthur Read" g male "Elwood" HS 1

2015-11-01 "Arthur Read" h female "Elwood" BA 3

2015-11-01 "Arthur Read" i female "Elwood" BA 3

2015-11-01 "Arthur Read" j female "Elwood" BA 3

2015-12-01 "Buster Baxter" k male "Elwood" MA 4

2015-12-01 "Buster Baxter" l male "Bear Lake" MA 4

2015-12-01 "Buster Baxter" m female "Elwood" AA 2

2015-12-01 "Buster Baxter" n male "Elwood" AA 2

2015-12-01 "Buster Baxter" o other "Elwood" BA 3

2015-12-01 "Buster Baxter" p female "Elwood" BA 3

2015-12-01 "Buster Baxter" q female "Bear Lake" PHD 5

"""))

### Treat Date, Instructor, Gender, County, and Education as category type

Data['Date'] = Data['Date'].astype('category')

Data['Instructor'] = Data['Instructor'].astype('category')

Data['Gender'] = Data['Gender'].astype('category')

Data['County'] = Data['County'].astype('category')

Data['Education'] = Data['Education'].astype('category')

print(Data.info())

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Date 17 non-null category

1 Instructor 17 non-null category

2 Student 17 non-null object

3 Gender 17 non-null category

4 County 17 non-null category

5 Education 17 non-null category

6 EdCode 17 non-null int64

By default, the order of categories will be alphabetized.

print(Data['Education'].cat.categories)

Index(['AA', 'BA', 'HS', 'MA', 'PHD'], dtype='object')

The levels of categories can be re-ordered. This is useful for examining output and plots.

EducationLevels =["HS", "AA", "BA", "MA", "PHD" ]

Data['Education'] = Data['Education'].cat.reorder_categories(EducationLevels)

print(Data['Education'].cat.categories)

Index(['HS', 'AA', 'BA', 'MA', 'PHD'], dtype='object')

Optional code to assign numeric values to a variable based on a category variable

Here, EdCode was already included as a numeric equivalent to Education. But we could create a numeric variable to reflect the numeric equivalent of Education if it weren’t already included in the data frame.

Here, we'll create a new variable EdCode2 with the numeric assignment for Education.

This can be accomplished by creating a dictionary, here called Mapper.

Mapper = {"HS":1, "AA":2, "BA":3, "MA":4, "PHD":5}

Data["EdCode2"] = Data["Education"].replace(Mapper)

Data['EdCode2'] = Data['EdCode2'].astype('int64')

print(Data.info())

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Date 17 non-null category

1 Instructor 17 non-null category

2 Student 17 non-null object

3 Gender 17 non-null category

4 County 17 non-null category

5 Education 17 non-null category

6 EdCode 17 non-null int64

7 EdCode2 17 non-null int64

This could also be accomplished by manually defining a new variable, EdCode3, based on the value of the variable Education.

Data['EdCode3'] = 9999

Data.loc[Data['Education'] == 'HS', 'EdCode3'] = 1

Data.loc[Data['Education'] == 'AA', 'EdCode3'] = 2

Data.loc[Data['Education'] == 'BA', 'EdCode3'] = 3

Data.loc[Data['Education'] == 'MA', 'EdCode3'] = 4

Data.loc[Data['Education'] == 'PHD', 'EdCode3'] = 5

print(Data.info())

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Date 17 non-null category

1 Instructor 17 non-null category

2 Student 17 non-null object

3 Gender 17 non-null category

4 County 17 non-null category

5 Education 17 non-null category

6 EdCode 17 non-null int64

7 EdCode2 17 non-null int64

8 EdCode3 17 non-null int64

Optional code to change a category variable to an ordered category variable

Ordinal variables can be explicitly designated as ordered category variables.

Note that when the array for EducationOrdered is printed, the order is indicated with less than symbols, e.g. 'HS' < 'AA'.

EducationLevels =["HS", "AA", "BA", "MA", "PHD" ]

Data['EducationOrdered'] = Data['Education'].astype(CategoricalDtype(

categories=EducationLevels,

ordered=True))

print(Data['EducationOrdered'])

Name: Education, dtype: category

Categories (5, object): ['HS' < 'AA' < 'BA' < 'MA' < 'PHD']

Dropping unused variables

### Drop EdCode2 and EdCode3

Data = Data.drop(columns=['EdCode2', 'EdCode3', 'EducationOrdered'])

print(Data.info())

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Date 17 non-null category

1 Instructor 17 non-null category

2 Student 17 non-null object

3 Gender 17 non-null category

4 County 17 non-null category

5 Education 17 non-null category

6 EdCode 17 non-null int64

Statistics of location for ordinal data

The median can be used as a statistic for location for ordinal data. Other quantiles (25^th percentile, and so on) also make sense for ordinal values. Usually, the mean and standard deviation shouldn’t be used.

Data['EdCode'].median()

3.0

### Remember that 3 meant Bachelor’s degree in our coding

Summary = Data['EdCode'].describe()

print(Summary)

EdCode

count 17.000000

mean 3.117647

std 1.166316

min 1.000000

25% 3.000000

50% 3.000000

75% 4.000000

max 5.000000

### Remember that 1 meant High school, 3 meant Bachelor’s,
### 4 meant Master’s, and 5 meant Ph.D.

### Remember to ignore the mean and standard deviation values for ordinal data.

Statistics of variation for ordinal data

The interquartile range is the difference between the 25^th percentile and the 75^th percentile. For ordinal data, the interquartile range itself may not make sense, since it requires taking a difference of ordinal values. However, identifying the 25^th and 75^th percentiles is useful, as 50% of observations fall within these values.

In this example, the 25^th percentile corresponds to Bachelor’s, and the 75^th percentile corresponds to Master’s.

Statistics for grouped ordinal data

Summary = Data[['Gender', 'EdCode']].groupby('Gender').describe()

print(Summary)

EdCode

count mean std min 25% 50% 75% max

Gender

female 10.0 3.5 0.971825 2.0 3.00 3.0 4.00 5.0

male 6.0 2.5 1.378405 1.0 1.25 2.5 3.75 4.0

other 1.0 3.0 NaN 3.0 3.00 3.0 3.00 3.0

### Remember that 1 meant High school, 3 meant Bachelor’s,
### 4 meant Master’s, and 5 meant Ph.D.

### Remember to ignore the mean and standard deviation values for ordinal data.

Summary = Data[['Gender', 'County', 'EdCode']].groupby(['Gender', 'County']).describe()

print(Summary)

EdCode

count mean std min 25% 50% 75% max

Gender County

female Bear Lake 3.0 4.666667 0.57735 4.0 4.5 5.0 5.0 5.0

Elwood 7.0 3.000000 0.57735 2.0 3.0 3.0 3.0 4.0

male Bear Lake 1.0 4.000000 NaN 4.0 4.0 4.0 4.0 4.0

Elwood 5.0 2.200000 1.30384 1.0 1.0 2.0 3.0 4.0

other Elwood 1.0 3.000000 NaN 3.0 3.0 3.0 3.0 3.0

### Remember that 1 meant High school, 3 meant Bachelor’s,
### 4 meant Master’s, and 5 meant Ph.D.

### Remember to ignore the mean and standard deviation values for ordinal data.

Statistics for ordinal data treated as nominal data

Ordinal data can also be summarized with the methods used for nominal data.

If the levels of the categorical variables are not already in the correct order, it is useful to reorder the variables.

EducationLevels =["HS", "AA", "BA", "MA", "PHD" ]

Data['Education'] = Data['Education'].cat.reorder_categories(EducationLevels)

print(Data['Education'].cat.categories)

Index(['HS', 'AA', 'BA', 'MA', 'PHD'], dtype='object')

Some basic information for categorical variables can be obtained with describe in pandas.

Summary = Data.describe(include='category')

print(Summary)

Date Instructor Gender County Education

count 17 17 17 17 17

unique 2 2 3 2 5

top 2015-11-01 Arthur Read female Elwood BA

freq 10 10 10 13 7

The counts for each level of a categorical variable can be found with the value.counts function.

Data['Education'].value_counts()

BA 7

MA 4

HS 2

AA 2

PHD 2

But note that the levels of Education are alphabetized in the output. We can indicate the order of the categories in the output with the reindex function, adding fill_value for the value for any empty cells.

Data['Education'].value_counts().reindex(['HS','AA','BA','MA','PHD'], fill_value=0)

HS 2

AA 2

BA 7

MA 4

PHD 2

The crosstab function in pandas can be used among combinations of categorical variables.

For one variable, the Count in index='Count' is just a placeholder for the index for rows.

Table = pd.crosstab(columns=Data['Education'], index='Count')

print(Table)

Education HS AA BA MA PHD

row_0

Count 2 2 7 4 2

The normalize option can be used to report the proportions. Here, normalize='index' is used to report the proportion within rows.

Table = pd.crosstab(columns=Data['Education'], index='Count', normalize='index')

print(Table)

Education HS AA BA MA PHD

row_0

Count 0.117647 0.117647 0.411765 0.235294 0.117647

Grouped data

Table = pd.crosstab(columns=Data['Education'], index=Data['Gender'])

print(Table)

Education HS AA BA MA PHD

Gender

female 0 1 5 2 2

male 2 1 1 2 0

other 0 0 1 0 0

Note, here, that normalize='index' is used to report the proportion within rows. Other options are 'columns' and 'all' to determine the proportion within columns or within the table.

Table = pd.crosstab(columns=Data['Education'], index=Data['Gender'], normalize='index')

print(Table)

Education HS AA BA MA PHD

Gender

female 0.000000 0.100000 0.500000 0.200000 0.2

male 0.333333 0.166667 0.166667 0.333333 0.0

other 0.000000 0.000000 1.000000 0.000000 0.0

Two-way grouped data

When the outputted table would have three or more dimensions, pandas reports it in long format.

Table = Data[['County', 'Gender', 'Education']].groupby(['County', 'Gender']).value_counts()

print(Table)

County Gender Education

Bear Lake female PHD 2

MA 1

HS 0

AA 0

BA 0

male MA 1

PHD 0

BA 0

AA 0

HS 0

other PHD 0

MA 0

BA 0

AA 0

HS 0

Elwood female BA 5

MA 1

AA 1

PHD 0

HS 0

male HS 2

AA 1

BA 1

MA 1

PHD 0

other BA 1

HS 0

AA 0

MA 0

PHD 0

Another approach would be to separate out the levels of one of the variables to produce multiple two-dimensional tables. Here, the data are subsetted to include only Bear Lake County. The dropna=False option retains the columns or rows will all 0 counts.

DataBear = Data.loc[Data['County'] == 'Bear Lake']

TableBear = pd.crosstab(columns=DataBear['Education'], index=Data['Gender'],

dropna=False)

print(TableBear)

Education HS AA BA MA PHD

Gender

female 0 0 0 1 2

male 0 0 0 1 0

other 0 0 0 0 0

DataElwood = Data.loc[Data['County'] == 'Elwood']

TableElwood = pd.crosstab(columns=DataElwood['Education'], index=Data['Gender'],

dropna=False)

print(TableElwood)

Education HS AA BA MA PHD

Gender

female 0 1 5 1 0

male 2 1 1 1 0

other 0 0 1 0 0

Descriptive statistics for nominal data

Example of descriptive statistics for nominal data

This example will use the same Arthur and Buster data set used above. The analysis here will be similar to the analysis for treating ordinal data as nominal above. There are variants of some analyses in that section.

Data = pd.read_table(sep="\\s+", filepath_or_buffer=io.StringIO("""

Date Instructor Student Gender County Education EdCode

2015-11-01 "Arthur Read" a female "Elwood" BA 3

2015-11-01 "Arthur Read" b female "Bear Lake" PHD 5

2015-11-01 "Arthur Read" c male "Elwood" BA 3

2015-11-01 "Arthur Read" d female "Elwood" MA 4

2015-11-01 "Arthur Read" e male "Elwood" HS 1

2015-11-01 "Arthur Read" f female "Bear Lake" MA 4

2015-11-01 "Arthur Read" g male "Elwood" HS 1

2015-11-01 "Arthur Read" h female "Elwood" BA 3

2015-11-01 "Arthur Read" i female "Elwood" BA 3

2015-11-01 "Arthur Read" j female "Elwood" BA 3

2015-12-01 "Buster Baxter" k male "Elwood" MA 4

2015-12-01 "Buster Baxter" l male "Bear Lake" MA 4

2015-12-01 "Buster Baxter" m female "Elwood" AA 2

2015-12-01 "Buster Baxter" n male "Elwood" AA 2

2015-12-01 "Buster Baxter" o other "Elwood" BA 3

2015-12-01 "Buster Baxter" p female "Elwood" BA 3

2015-12-01 "Buster Baxter" q female "Bear Lake" PHD 5

"""))

### Treat Date, Instructor, Gender, County, and Education as category type

Data['Date'] = Data['Date'].astype('category')

Data['Instructor'] = Data['Instructor'].astype('category')

Data['Gender'] = Data['Gender'].astype('category')

Data['County'] = Data['County'].astype('category')

Data['Education'] = Data['Education'].astype('category')

print(Data.info())

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Date 17 non-null category

1 Instructor 17 non-null category

2 Student 17 non-null object

3 Gender 17 non-null category

4 County 17 non-null category

5 Education 17 non-null category

6 EdCode 17 non-null int64

dtypes: category(5), int64(1), object(1)

Some basic information for categorical variables can be obtained with describe in pandas.

Summary = Data.describe(include='category')

print(Summary)

Date Instructor Gender County Education

count 17 17 17 17 17

unique 2 2 3 2 5

top 2015-11-01 Arthur Read female Elwood BA

freq 10 10 10 13 7

One-sample data

Table = pd.crosstab(columns=Data['Gender'], index='Count')

print(Table)

Gender female male other

row_0

Count 10 6 1

### Counts of each level of Gender

Table = pd.crosstab(columns=Data['Gender'], index='Count', normalize='index')

print(Table)

Gender female male other

row_0

Count 0.588235 0.352941 0.058824

### Proportion of each level Gender for each row

One-way data

Table = pd.crosstab(columns=Data['Gender'], index=Data['Date'])

print(Table)

Gender female male other

Date

2015-11-01 7 3 0

2015-12-01 3 3 1

### Counts for Gender within each level of Date

TableProp = pd.crosstab(columns=Data['Gender'], index=Data['Date'], normalize='index')

print(TableProp)

Gender female male other

Date

2015-11-01 0.700000 0.300000 0.000000

2015-12-01 0.428571 0.428571 0.142857

### Proportion of each level of Gender for each row

sum(Table.sum())

### Sum of observations in the table

Table.sum(axis=1)

Date

2015-11-01 10

2015-12-01 7

### Sum of observations in each row of the table

Table.sum(axis=0)

Gender

female 10

male 6

other 1

### Sum of observations in each column of the table

Two-way data

Table = Data[['Date', 'County', 'Gender']].groupby(['Date', 'County']).value_counts()

print(Table)

### Note the placement of Date and County in the groupby call,

### and the resultant placement in the output.

Date County Gender

2015-11-01 Bear Lake female 2

male 0

other 0

Elwood female 5

male 3

other 0

2015-12-01 Bear Lake female 1

male 1

other 0

Elwood female 2

male 2

other 1

Levels for category variables

By default, Python will alphabetize the levels of a category variable. The order of the levels can be changed, for example, to change the order that groups are plotted in a plot, or which groups are at the top of the table.

Revisiting the Arthur and Baxter example, we’ll explicitly change the category variables from type object to type category.

Data = pd.read_table(sep="\\s+", filepath_or_buffer=io.StringIO("""

Date Instructor Student Gender County Education EdCode

2015-11-01 "Arthur Read" a female "Elwood" BA 3

2015-11-01 "Arthur Read" b female "Bear Lake" PHD 5

2015-11-01 "Arthur Read" c male "Elwood" BA 3

2015-11-01 "Arthur Read" d female "Elwood" MA 4

2015-11-01 "Arthur Read" e male "Elwood" HS 1

2015-11-01 "Arthur Read" f female "Bear Lake" MA 4

2015-11-01 "Arthur Read" g male "Elwood" HS 1

2015-11-01 "Arthur Read" h female "Elwood" BA 3

2015-11-01 "Arthur Read" i female "Elwood" BA 3

2015-11-01 "Arthur Read" j female "Elwood" BA 3

2015-12-01 "Buster Baxter" k male "Elwood" MA 4

2015-12-01 "Buster Baxter" l male "Bear Lake" MA 4

2015-12-01 "Buster Baxter" m female "Elwood" AA 2

2015-12-01 "Buster Baxter" n male "Elwood" AA 2

2015-12-01 "Buster Baxter" o other "Elwood" BA 3

2015-12-01 "Buster Baxter" p female "Elwood" BA 3

2015-12-01 "Buster Baxter" q female "Bear Lake" PHD 5

"""))

### Treat Date, Instructor, Gender, County, and Education as category type

Data['Date'] = Data['Date'].astype('category')

Data['Instructor'] = Data['Instructor'].astype('category')

Data['Gender'] = Data['Gender'].astype('category')

Data['County'] = Data['County'].astype('category')

Data['Education'] = Data['Education'].astype('category')

print(Data.info())

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Date 17 non-null category

1 Instructor 17 non-null category

2 Student 17 non-null object

3 Gender 17 non-null category

4 County 17 non-null category

5 Education 17 non-null category

6 EdCode 17 non-null int64

Below, note that the levels of the factor variables were alphabetized by default. That is, even though Elwood was found in the data before Bear Lake, Python treats Bear Lake as the first level in the variable County.

print(Data['County'].cat.categories)

Index(['Bear Lake', 'Elwood'], dtype='object')

We order factor levels in an order of our choosing.

print(Data['Education'].cat.categories)

Index(['AA', 'BA', 'HS', 'MA', 'PHD'], dtype='object')

EducationLevels = ["HS", "AA", "BA", "MA", "PHD"]

Data['Education'] = Data['Education'].cat.reorder_categories(EducationLevels)

print(Data['Education'].cat.categories)

Index(['HS', 'AA', 'BA', 'MA', 'PHD'], dtype='object')

We can also indicate a category variable should be treated as ordered categories.

EducationLevels =["HS", "AA", "BA", "MA", "PHD" ]

Data['EducationOrdered'] = Data['Education'].astype(CategoricalDtype(

categories=EducationLevels,

ordered=True))

print(Data['EducationOrdered'])

Name: Education, dtype: category

Categories (5, object): ['HS' < 'AA' < 'BA' < 'MA' < 'PHD']

Note that in the actions above, we are not changing the order in the data frame, simply which level is treated internally as first category, and so on.

Optional analyses

Robust estimators: trimmed mean and Winsorized mean

In this example, the means and medians for Team A and Team B are identical (mean of 70 and median of 80). However, the trimmed mean for Team A is less than for Team B (77 vs. 80), and the Winsorized mean for A is less than for Team B (75 vs. 78). According to either the trimmed mean method or the Winsorized mean method, Team B had a higher score when extreme observations were removed or replaced.

TeamA = pd.Series([100, 90, 80, 60, 20])

TeamB = pd.Series([80, 80, 80, 80, 30])

print(np.mean(TeamA))

70.0

print(np.median(TeamA))

80.0

The trimmed mean can be calculated with the trim_mean function in SciPy stats. Here, both the largest 20% and smallest 20% of observations are trimmed from the array.

stats.trim_mean(TeamA, proportiontocut = 0.20)

76.66666666666667

The trimmed array can be displayed by subsetting the array.

TrimA = TeamA[(TeamA > np.percentile(TeamA, 20)) & (TeamA < np.percentile(TeamA, 80))]

print(TrimA)

1 90

2 80

3 60

print(np.mean(TrimA))

76.66666666666667

### The trimmed mean at 20%.

The clip function in NumPy wisorizes the data. Here the lowest 20% of observations are replaced with the value of the 20^th percentile, and the highest 20% of observations are replaced with the value of the 80^th percentile. Different methods can be used for the calculation of the percentiles.

WinsorA = np.clip(TeamA, np.percentile(TeamA, 20), np.percentile(TeamA, 80))

print(WinsorA)

0 92

1 90

2 80

3 60

4 52

### This uses the same method for percentiles as does R by default.

print(np.mean(WinsorA))

74.8

### The wisorized mean at 20%.

scipy.stats.mstats.winsorize will winsorize the data with simpler syntax, but it doesn’t offer options for how the percentiles are calculated. So, here the result is different. For larger continuous samples, there won’t usually be large differences in the results of the two methods.

WinsorAAlt = mstats.winsorize(TeamA, limits=[0.2, 0.2])

print(WinsorAAlt)

[90 90 80 60 60]

print(np.mean(WinsorAAlt))

76.0

The following code replicates the analysis for Team B.

TeamB = pd.Series([80, 80, 80, 80, 30])

print(np.median(TeamB))

80.0

print(np.mean(TeamB))

70.0

stats.trim_mean(TeamB, proportiontocut = 0.20)

80.0

### Trimmed mean for Team B at 20%

WinsorB = np.clip(TeamB, np.percentile(TeamB, 20), np.percentile(TeamB, 80))

print(WinsorB)

0 80

1 80

2 80

3 80

4 70

print(np.mean(WinsorB))

78.0

### Winsorized mean for Team B at 20%

Geometric mean

The geometric mean can be calculated with the gmean function in SciPy stats. It can also be calculated with a nested log, mean, and exp functions in NumPy.

Bacteria example

Bacteria = pd.Series([20, 40, 50, 60, 100, 120, 150, 200, 1000])

print(stats.gmean(Bacteria))

98.38886970015734

np.exp(np.mean(np.log(Bacteria)))

98.38886970015734

Annual return example

Note that with the annual return example, that 1 must be added to all the values, and then subtracted at the end to return the values to proportion of return.

Return = pd.Series([-0.80, 0.20, 0.30, 0.30, 0.20, 0.30])

print(stats.gmean(Return+1)-1)

-0.07344666385976306

Harmonic mean

The harmonic mean can be calculated with the hmean function in SciPy stats. It can also be calculated by taking the inverse of each value, averaging these values, and reporting the inverse of this result.

Speed = pd.Series([20, 40, 50, 60, 100, 120, 150, 200, 1000])

print(stats.hmean(Speed))

63.08411214953271

print(1/np.mean(1/Speed))

63.08411214953271

A Python Companion to Extension Program Evaluation

Descriptive Statistics

Mean

Median

Mode

Standard deviation

Median absolute deviation

Standard error of the mean

Five-number summary, quartiles, percentiles

Adding variables to a data frame

Other statistics for grouped variables

Removing missing values

Counting missing values

Missing values in arrays

Optional code to assign numeric values to a variable based on a category variable

Optional code to change a category variable to an ordered category variable

Dropping unused variables

Grouped data

Two-way grouped data

One-sample data

One-way data

Two-way data

Bacteria example

Annual return example