Pandas dataset analysis example

Let’s show how can we use Pandas for data analysis. Let’s take a cardio dataset from here: https://github.com/iryndin/Machine-Learning/tree/master/data/01-cardio

Task 0. Import dataset

First, import necessary libs and read the data:

import pandas as pd
import numpy as np

df = pd.read_csv('cardio.csv', sep=';', index_col='id')

Now we have DataFrame (df), which is one of main entities Pandas work with.

This dataset contains following fields:

Field Description
id Patient ID
age Age in days
gender Male or female
age Age in days
height Height in cm
weight Weight in kg
ap_hi Arterial Pressure (high)
ap_lo Arterial Pressure (low)
cholesterol ?
gluc ?
smoke Smoking or not
alco Drinks alco or not
active ?
cardio Has cardio probem or not

Task 1. Count males and females

Let’s count total number of males and females. OK, we have a column gender which contains values 1 and 2. But which of them is male, and which is female? It is not clear.

We can determine this using reasonable assumption that males are in average higher than females. So, we need to filter those rows which have gender=1 and calculate mean height for them. Then we should calculate mean height value for rows with gender=2. Then we should check which of values if bigger, and hence assume that those are for males. Let’s do this:

>>> df[df['gender'] == 1]['height'].mean()
161.35561168460356
>>> df[df['gender'] == 2]['height'].mean()
169.94789538210054

Here we see that for rows with gender=2 mean height is greater that for rows with gender=1. So we can conclude that gender=2 is for males, while gender=1 is for females. Let’s create separate dataframes bot for males and females:

males = df[df['gender'] == 2]
females = df[df['gender'] == 1]

Now let’s answer this task’s question about number of males and females:

>>> males['age'].count()
24470
>>> females['age'].count()
45530

So the answer to Task 1 is:

  • 24470 males
  • 45530 females

Task 2. Calculate who drinks alcohol less (in average): males or females

Alcohol usage column is alco and it contains values 1 (in use) and 0 (not in use). So we simply need to calculate a portion of alco drinkes for males and females and compare these values.

We can do it in two ways:

  • filter by alco column value and calculate proportion
  • simply sum alco column values and divide by total number of males/females respectively.

Let’s do both. First one:

>>> males[males['alco']==1]['age'].count()/float(males['age'].count())
0.10637515324887617
>>> females[females['alco']==1]['age'].count()/float(females['age'].count())
0.025499670546892159

Second approach:

>>> males['alco'].sum()/float(males['age'].count())
0.10637515324887617
>>> females['alco'].sum()/float(females['age'].count())
0.025499670546892159

We see that portion of males drinking alco is 10.6%, while for females it is 2.5%.

So the answer to Task 2 is: females drink alco less.

Task 3. Calculate how many times greater percentage of smoking males is compared to that of females?

Let’s calculate this:

>>> smoking_males_percentage = 100*males[males['smoke']==1]['age'].count()/float(males['age'].count())
>>> smoking_females_percentage = 100*females[females['smoke']==1]['age'].count()/float(females['age'].count())
>>> round(smoking_males_percentage/smoking_females_percentage)
12.0

Answer to Task 3 is: 12

Task 4. In what units age is measured here? What is difference (in months) between median age of smokers and non-smokers?

OK, let’s look at age stats:

>>> df['age'].describe()
count    70000.000000
mean     19468.865814
std       2467.251667
min      10798.000000
25%      17664.000000
50%      19703.000000
75%      21327.000000
max      23713.000000
Name: age, dtype: float64

Ok, this is not similar to years, and even to months. May be in days? Let’s check:

>>> df['age'].mean()/365
53.339358395303321

This is better. So age here is measured in days. Now let’s calculate difference (in months) between median age of smokers and non-smokers:

>>> round((df[df['smoke']==0]['age'].median() - df[df['smoke']==1]['age'].median())/30)
20.0

The answer for Task 4 is:

  • age here is in days
  • difference (in months) between median age of smokers and non-smokers: 20

Task 5. Create 2 special datasets from initial one and calculate in how many times portion of sick people in one dataset is greater than that of another dataset?

Look at Wikipedia article about HeartScore, it gives clarification to following actions.

Do following:

  • Create new column - age_years, round its values to integers. Then for this task take only smoking males aged 60 to 64 (inclusive).
  • In the dataset cholesterol level 1 is equal to 4 mmol/liter, 2 -> 5-7 mmol/liter, 3 maps to 8 mmol/liter.
  • Create 2 datasets out of smoking males 60-64: ap_hi < 120 and cholesterol=1 , another: 160 <= ap_hi < 180 and cholesterol=3
  • For given 2 datasets calculate in how many times portion of cardio-sick people in one dataset is greater than that of another dataset

Do this:

>>> males['age_years'] = males.apply(lambda row: int(round(row.age/365.0)), axis=1)
>>> smoking_males = males[(males['smoke']==1) & (males['age_years']>=60) & (males['age_years']<65)]
>>> smoking_males_1 = smoking_males[(smoking_males['cholesterol']==1) & (smoking_males['ap_hi']<120)] 
>>> smoking_males_2 = smoking_males[(smoking_males['cholesterol']==3) & (smoking_males['ap_hi']<180) & (smoking_males['ap_hi']>=160)] 
>>> cardio_percentage_1 = 100*smoking_males_1[smoking_males_1['cardio']==1]['age'].count()/float(smoking_males_1['age'].count())
>>> cardio_percentage_2 = 100*smoking_males_2[smoking_males_2['cardio']==1]['age'].count()/float(smoking_males_2['age'].count())
>>> round(cardio_percentage_2/cardio_percentage_1)
3

Answer to Task 5: 3

Task 6. Assess some BMI statements

BMI (Body Mass index) is calculated as weight in kg divided by square of height in meters. Normal BMI values are between 18.5 and 25.

Assess following statements (is it true of false):

  • Median BMI is greater than its normal values
  • Females BMI is in average less than males BMI
  • Cardio-healthy persons have in average greater BMI than cardio-sick ones
  • For cardio-healthy non-alco males BMI values in average are closer to norm than for cardio-healthy non-alco females

Solution:

>>> df['bmi'] = df.apply(lambda row: row.weight/((row.height**2)/10000.0), axis=1)
>>> df['bmi'].median()
26.374068120774975
>>> df[df['gender']==1]['bmi'].mean()
27.98758344183285
>>> df[df['gender']==2]['bmi'].mean()
26.754442357289349
>>> df[df['cardio']==0]['bmi'].mean()
26.548175206794504
>>> df[df['cardio']==1]['bmi'].mean()
28.56606062701535
>>> df[(df['cardio']==0) & (df['alco']==0) & (df['gender']==2)]['bmi'].mean()
25.872638075460163
>>> df[(df['cardio']==0) & (df['alco']==0) & (df['gender']==1)]['bmi'].mean()
26.845406594131507

Answer to Task 6:

  • Median BMI is greater than its normal values. TRUE. (Because 26.374068120774975 > 25)
  • Females BMI is in average less than males BMI. FALSE (Because 27.98758344183285 > 26.754442357289349)
  • Cardio-healthy persons have in average greater BMI than cardio-sick ones. FALSE. (Because 26.548175206794504 < 28.56606062701535)
  • For cardio-healthy non-alco males BMI values in average are closer to norm than for cardio-healthy non-alco females. TRUE. (Because 25.872638075460163 is closer to 25 than 26.845406594131507)

Task 7. Clean up data.

Filter data according to the following (throw out rows matching following criteria):

  • ap_lo > ap_hi
  • height is less than 2.5 percentile or greater than 97.5 percentile
  • weight is less than 2.5 percentile or greater than 97.5 percentile

How many percents of data was thrown out?

>>> dfclean = df[df['ap_lo'] <= df['ap_hi']]
>>> h025 = df['height'].quantile(0.025)
>>> h975 = df['height'].quantile(0.975)
>>> dfclean = dfclean[(dfclean['height'] >= h025) & (dfclean['height'] <= h975)]
>>> w025 = df['weight'].quantile(0.025)
>>> w975 = df['weight'].quantile(0.975)
>>> dfclean = dfclean[(dfclean['weight'] >= w025) & (dfclean['weight'] <= w975)]
>>> size = df['age'].count()
>>> sizeclean = dfclean['age'].count()
>>> round(100.0*(1 - sizeclean/float(size)))
10

Answer to Task 7: 10