Data Visualization - Week 3

Data management is the issue of this third week, data management involves making decisions about data, that will help answer the research questions. The main guides for this are the codebook and frequency distributions.

After watching the videos of this week and examine again, (this time in a detailed way), the codebook and frequency distributions, I found out that I have to make some changes.

Rename Variables
Parsing Numeric Values
Missing Values
Dropping Missing Values
Frequencies Distribution
Categorical Variables
Data Dictionary
Categorical Frequencies Distribution
Full Code

Rename Variables

For convenience and clarity of coding I renamed variables, lines 14-15.

14 # Rename columns for clarity                                    
15 data1.columns = ['country','income','alcohol','life']

On line 23 is showed the info about the dataset.

# Show info about dataset
23 data1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 213 entries, 0 to 212
Data columns (total 4 columns):
country    213 non-null object
income     213 non-null object
alcohol    213 non-null object
life       213 non-null object
dtypes: object(4)
memory usage: 6.7+ KB

Parsing Numeric Values]

The function pd.to_numeric, convert arguments to a numeric types and the parameter ‘errors=coerce’ setting invalid parsings (Blanks in this case) as NaN value. At lines 26-29 this function is applied to each observation (each line of the variables), and then the information of the dataset are presented again.

26 for dt in ('income','alcohol','life') :
27    data1[dt] = pd.to_numeric(data1[dt], 'errors=coerce')
28    
29 print (data1.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 213 entries, 0 to 212
Data columns (total 4 columns):
country    213 non-null object
income     190 non-null float64
alcohol    187 non-null float64
life       191 non-null float64
dtypes: float64(3), object(1)
memory usage: 6.7+ KB

Missing Values

Lines 32-33 displays information about the data set, once again: While it may seem that there are no “missing data” (no null objects) actually missing values are, in this case, the space values in blank. In lines 10 to 20 XXX presented to the code conversion invalid entries (white) for a numpy NaN value as well as the first ten rows of the result.

32 xdata1 = data1[data1.income.isnull() | data1.life.isnull() | data1.alcohol.isnull() | data1.country.isnull()]
33 print (tabulate(xdata1.head(10), headers=['index','Country','Income','Alcohol','Life']))
index  Country                Income    Alcohol     Life
-------  -------------------  --------  ---------  -------
    0  Afghanistan            nan          0.03   48.673
    3  Andorra              21943.3       10.17  nan
    5  Antigua and Barbuda  11894.5        8.17  nan
    8  Aruba                  nan        nan      75.246
   20  Bermuda              62682.1      nan     nan
   34  Cayman Islands         nan        nan     nan
   43  Cook Islands           nan          3.23  nan
   52  Dominica              6147.78       8.68  nan
   61  Faeroe Islands         nan        nan     nan
   65  French Polynesia       nan        nan      75.133

Dropping Missing Data

Since that the objective of this work is investigating the impact of alcohol consumption on the life expectancy, no make sense to consider entries with missing data, thus in the lines 35-37 these entries are dropped.

36 data1 = data1.dropna(axis=0, how='any')
37 print (data1.info())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 171 entries, 1 to 212
Data columns (total 4 columns):
country    171 non-null object
income     171 non-null float64
alcohol    171 non-null float64
life       171 non-null float64
dtypes: float64(3), object(1)
memory usage: 6.7+ KB

Frequencies Distributions

At lines 40-68 the frequencies (absolute and relative) are calculate and showed following, for clarity is presented only 10 first.

40 freq_life_n = data1.life.value_counts(sort=False)
41 freq_income_n = data1.income.value_counts(sort=False)
42 freq_alcohol_n = data1.alcohol.value_counts(sort=False)
43
44 # Relative Frequency distributions
45 freq_life_r = data1.life.value_counts(sort=False, normalize=True)
46 freq_income_r = data1.income.value_counts(sort=False, normalize=True)
47 freq_alcohol_r = data1.alcohol.value_counts(sort=False, normalize=True)
48
49 print ('********************************************************')
50 print ('* Absolute Frequencies original variables (first 5)    *')
51 print ('********************************************************')
52 print ('\nlife variable ('+LIFE+'):')
53 print ( tabulate([freq_life_n.head(5)], tablefmt="fancy_grid", headers=([i for i in freq_life_n.index])) )
54 print ('\nincome variable ('+INCOME+'):')
55 print ( tabulate([freq_income_n.head(5)], tablefmt="fancy_grid", headers=([i for i in freq_income_n.index])) )
56 print ('\nalcohol variable ('+ALCOHOL+'):')
57 print ( tabulate([freq_life_n.head(5)], tablefmt="fancy_grid", headers=([i for i in freq_life_n.index])) )
58
59 print ('\n********************************************************')
60 print ('* Relative Frequencies original variables (first 5)    *')
61 print ('********************************************************')
62
63 print ('\nlife variable ('+LIFE+'):')
64 print ( tabulate([freq_life_r.head(5)], tablefmt="fancy_grid", headers=([i for i in freq_life_n.index])) )
65 print ('\nincome variable ('+INCOME+'):')
66 print ( tabulate([freq_income_r.head(5)], tablefmt="fancy_grid", headers=([i for i in freq_income_n.index])) )
67 print ('\nalcohol variable ('+ALCOHOL+'):')
68 print ( tabulate([freq_life_r.head(5)], tablefmt="fancy_grid", headers=([i for i in freq_life_n.index])) )

********************************************************
* Absolute Frequencies original variables (first 5)    *
********************************************************

life variable (2011 life expectancy at birth (years)):
╒══════════╤══════════╤══════════╤══════════╤══════════╕
│   74.788 │   57.937 │   69.317 │   73.456 │   77.005 │
╞══════════╪══════════╪══════════╪══════════╪══════════╡
│        1 │        1 │        1 │        1 │        1 │
╘══════════╧══════════╧══════════╧══════════╧══════════╛

income variable (2010 Gross Domestic Product per capita in constant 2000 US$):
╒════════════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╕
│   561.70858483 │   5634.00394802 │   22275.7516608 │   772.933344801 │   1975.55190594 │
╞════════════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╡
│              1 │               1 │               1 │               1 │               1 │
╘════════════════╧═════════════════╧═════════════════╧═════════════════╧═════════════════╛

alcohol variable (2008 alcohol consumption per adult (liters, age 15+)):
╒══════════╤══════════╤══════════╤══════════╤══════════╕
│   74.788 │   57.937 │   69.317 │   73.456 │   77.005 │
╞══════════╪══════════╪══════════╪══════════╪══════════╡
│        1 │        1 │        1 │        1 │        1 │
╘══════════╧══════════╧══════════╧══════════╧══════════╛

********************************************************
* Relative Frequencies original variables (first 5)    *
********************************************************

life variable (2011 life expectancy at birth (years)):
╒════════════╤════════════╤════════════╤════════════╤════════════╕
│     74.788 │     57.937 │     69.317 │     73.456 │     77.005 │
╞════════════╪════════════╪════════════╪════════════╪════════════╡
│ 0.00584795 │ 0.00584795 │ 0.00584795 │ 0.00584795 │ 0.00584795 │
╘════════════╧════════════╧════════════╧════════════╧════════════╛

income variable (2010 Gross Domestic Product per capita in constant 2000 US$):
╒════════════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╕
│   561.70858483 │   5634.00394802 │   22275.7516608 │   772.933344801 │   1975.55190594 │
╞════════════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╡
│     0.00584795 │      0.00584795 │      0.00584795 │      0.00584795 │      0.00584795 │
╘════════════════╧═════════════════╧═════════════════╧═════════════════╧═════════════════╛

alcohol variable (2008 alcohol consumption per adult (liters, age 15+)):
╒════════════╤════════════╤════════════╤════════════╤════════════╕
│     74.788 │     57.937 │     69.317 │     73.456 │     77.005 │
╞════════════╪════════════╪════════════╪════════════╪════════════╡
│ 0.00584795 │ 0.00584795 │ 0.00584795 │ 0.00584795 │ 0.00584795 │
╘════════════╧════════════╧════════════╧════════════╧════════════╛

Creating Categorical Variables

Now is the time to create the categorical variables, for this, I calculate the min and max values of each variable. The code and result are presented (lines 71-89):

71 min_max = OrderedDict()
72 dict1 = OrderedDict()
73
74 dict1['min'] = data1.life.min()
75 dict1['max'] = data1.life.max()
76 min_max['life'] = dict1
77
78 dict2 = OrderedDict()
79 dict2['min'] = data1.income.min()
80 dict2['max'] = data1.income.max()
81 min_max['income'] = dict2
82
83 dict3 = OrderedDict()
84 dict3['min'] = data1.alcohol.min()
85 dict3['max'] = data1.alcohol.max()
86 min_max['alcohol'] = dict3
87
88 df = pd.DataFrame([min_max['income'],min_max['life'],min_max['alcohol']], index = ['Income','Life','Alcohol'])
89 print (tabulate(df.sort_index(axis=1, ascending=False), headers=['Var','Min','Max']))

Var          Min         Max
-------  -------  ----------
Income   103.776  105147
Life      47.794      83.394
Alcohol    0.03       23.01

Data Dictionary

Based on min and max values, I created ranges for each variable (k=1000):

Income	Life	Alcohol
1: >=100 <5k	1: >=40 <50	1: >=0.5 <5
2: >=5k <10k	2: >=50 <60	2: >=5 <10
3: >=10k <20k	3: >=60 <70	3: >=10 <15
4: >=20K <30K	4: >=70 <80	4: >=15 <20
5: >=30K <40K	5: >=80 <90	5: >=20 <25
6: >=40K <50K	NA	NA

At lines 01, the first 10 lines of the dataset are showed with new variables.

100 data2['income'] = pd.cut(data1.income,[100,5000,10000,20000,30000,40000,50000], labels=['1','2','3','4','5','6        '])
101 print (tabulate(data2.head(10),headers='keys'))

    country       income    alcohol    life
--  ----------  --------  ---------  ------
1  Albania            1       7.29  76.918
2  Algeria            1       0.69  73.131
4  Angola             1       5.57  51.093
6  Argentina          3       9.35  75.901
7  Armenia            1      13.66  74.241
9  Australia          4      10.21  81.907
10 Austria            4      12.4   80.854
11 Azerbaijan         1      13.34  70.739
12 Bahamas            3       8.65  75.62
13 Bahrain            3       4.19  75.057

103 data2['life'] = pd.cut(data1.life,[40,50,60,70,80,90], labels=['1','2','3','4','5'])
104 print (tabulate(data2.head(10),headers='keys'))

    country       income    alcohol    life
--  ----------  --------  ---------  ------
1  Albania            1       7.29       4
2  Algeria            1       0.69       4
4  Angola             1       5.57       2
6  Argentina          3       9.35       4
7  Armenia            1      13.66       4
9  Australia          4      10.21       5
10 Austria            4      12.4        5
11 Azerbaijan         1      13.34       4
12 Bahamas            3       8.65       4
13 Bahrain            3       4.19       4

106 data2['alcohol'] = pd.cut(data1.alcohol,[0.5,5,10,15,20,25], labels=['1','2','3','4','5'])
107 print (tabulate(data2.head(10),headers='keys'))

    country       income    alcohol    life
--  ----------  --------  ---------  ------
1  Albania            1          2       4
2  Algeria            1          1       4
4  Angola             1          2       2
6  Argentina          3          2       4
7  Armenia            1          3       4
9  Australia          4          3       5
10 Austria            4          3       5
11 Azerbaijan         1          3       4
12 Bahamas            3          2       4
13 Bahrain            3          1       4

Categorical Frequency Distributions

The next lines (109-137) showed the absolute and relative frequencies distributions for the new categorical variables.


109 # absolute Frequency distributions
110 freq_life_n = data2.life.value_counts(sort=False)
111 freq_income_n = data2.income.value_counts(sort=False)
112 freq_alcohol_n = data2.alcohol.value_counts(sort=False)
113
114 # Relative Frequency distributions
115 freq_life_r = data2.life.value_counts(sort=False, normalize=True)
116 freq_income_r = data2.income.value_counts(sort=False, normalize=True)
117 freq_alcohol_r = data2.alcohol.value_counts(sort=False, normalize=True)
118
119 print ('************************')
120 print ('* Absolute Frequencies *')
121 print ('************************')
122 print ('\nlife variable ('+LIFE+'):')
123 print( tabulate([freq_life_n], tablefmt="fancy_grid", headers=(life_map.values())))
124 print ('\nincome variable ('+INCOME+'):')
125 print( tabulate([freq_income_n], tablefmt="fancy_grid", headers=(income_map.values())))
126 print ('\nalcohol variable ('+ALCOHOL+'):')
127 print( tabulate([freq_alcohol_n], tablefmt="fancy_grid", headers=(alcohol_map.values())))
128
129 print ('\n************************')
130 print ('* Relative Frequencies *')
131 print ('************************')
132 print ('\nlife variable ('+LIFE+'):')
133 print( tabulate([freq_life_r], tablefmt="fancy_grid", headers=(life_map.values())))
134 print ('\nincome variable ('+INCOME+'):')
135 print( tabulate([freq_income_r], tablefmt="fancy_grid", headers=(income_map.values())))
136 print ('\nalcohol variable ('+ALCOHOL+'):')
137 print( tabulate([freq_alcohol_r], tablefmt="fancy_grid", headers=(alcohol_map.values())))

************************
* Absolute Frequencies *
************************

life variable (Expectancy life):
╒════════════╤════════════╤════════════╤════════════╤════════════╕
│   >=40 <50 │   >=50 <60 │   >=60 <70 │   >=70 <80 │   >=80 <90 │
╞════════════╪════════════╪════════════╪════════════╪════════════╡
│          8 │         28 │         35 │         80 │         20 │
╘════════════╧════════════╧════════════╧════════════╧════════════╛

income variable (Income):
╒══════════════╤═════════════╤══════════════╤══════════════╤══════════════╤══════════════╕
│   >=100  <5k │   >=5k <10k │   >=10k <20k │   >=20K <30K │   >=30K <40K │   >=40K <50K │
╞══════════════╪═════════════╪══════════════╪══════════════╪══════════════╪══════════════╡
│          110 │          24 │           15 │           12 │            9 │            0 │
╘══════════════╧═════════════╧══════════════╧══════════════╧══════════════╧══════════════╛

alcohol variable (Alchohol):
╒════════════╤═══════════╤════════════╤════════════╤════════════╕
│   >=0.5 <5 │   >=5 <10 │   >=10 <15 │   >=15 <20 │   >=20 <25 │
╞════════════╪═══════════╪════════════╪════════════╪════════════╡
│         63 │        56 │         31 │         10 │          1 │
╘════════════╧═══════════╧════════════╧════════════╧════════════╛

************************
* Relative Frequencies *
************************

life variable (Expectancy life):
╒════════════╤════════════╤════════════╤════════════╤════════════╕
│   >=40 <50 │   >=50 <60 │   >=60 <70 │   >=70 <80 │   >=80 <90 │
╞════════════╪════════════╪════════════╪════════════╪════════════╡
│  0.0467836 │   0.163743 │   0.204678 │   0.467836 │   0.116959 │
╘════════════╧════════════╧════════════╧════════════╧════════════╛

income variable (Income):
╒══════════════╤═════════════╤══════════════╤══════════════╤══════════════╤══════════════╕
│   >=100  <5k │   >=5k <10k │   >=10k <20k │   >=20K <30K │   >=30K <40K │   >=40K <50K │
╞══════════════╪═════════════╪══════════════╪══════════════╪══════════════╪══════════════╡
│     0.647059 │    0.141176 │    0.0882353 │    0.0705882 │    0.0529412 │            0 │
╘══════════════╧═════════════╧══════════════╧══════════════╧══════════════╧══════════════╛

alcohol variable (Alchohol):
╒════════════╤═══════════╤════════════╤════════════╤════════════╕
│   >=0.5 <5 │   >=5 <10 │   >=10 <15 │   >=15 <20 │   >=20 <25 │
╞════════════╪═══════════╪════════════╪════════════╪════════════╡
│   0.391304 │  0.347826 │   0.192547 │  0.0621118 │ 0.00621118 │
╘════════════╧═══════════╧════════════╧════════════╧════════════╛

Full Code


1 # coding: utf-8
2 import pandas as pd
3 import numpy as np
4 from collections import OrderedDict
5 from tabulate import tabulate, tabulate_formats
6
7 # bug fix for display formats to avoid run time errors
8 pd.set_option('display.float_format', lambda x:'%f'%x)
9
10 # Load from CSV
11 data1 = pd.read_csv('gapminder.csv', skip_blank_lines=True,
12                     usecols=['country','incomeperperson', 'alcconsumption', 'lifeexpectancy'])
13                     
14 # Rename columns for clarity
15 data1.columns = ['country','income','alcohol','life']
16
17 # Variables Descriptions
18 ALCOHOL = "2008 alcohol consumption per adult (liters, age 15+)"
19 INCOME = "2010 Gross Domestic Product per capita in constant 2000 US$"
20 LIFE = "2011 life expectancy at birth (years)"
21
22 # Show info about dataset
23 data1.info()
24
25 # converting to numeric values and parsing
26 for dt in ('income','alcohol','life') :
27    data1[dt] = pd.to_numeric(data1[dt], 'errors=coerce')
28    
29 print (data1.info())
30
31 # Missing values, just for curiosity
32 xdata1 = data1[data1.income.isnull() | data1.life.isnull() | data1.alcohol.isnull() | data1.country.isnull()]
33 print (tabulate(xdata1.head(10), headers=['index','Country','Income','Alcohol','Life']))
34
35 # remove rows with nan values
36 data1 = data1.dropna(axis=0, how='any')
37 print (data1.info())
38
39 # absolute Frequency distributions
40 freq_life_n = data1.life.value_counts(sort=False)
41 freq_income_n = data1.income.value_counts(sort=False)
42 freq_alcohol_n = data1.alcohol.value_counts(sort=False)
43
44 # Relative Frequency distributions
45 freq_life_r = data1.life.value_counts(sort=False, normalize=True)
46 freq_income_r = data1.income.value_counts(sort=False, normalize=True)
47 freq_alcohol_r = data1.alcohol.value_counts(sort=False, normalize=True)
48
49 print ('********************************************************')
50 print ('* Absolute Frequencies original variables (first 5)    *')
51 print ('********************************************************')
52 print ('\nlife variable ('+LIFE+'):')
53 print ( tabulate([freq_life_n.head(5)], tablefmt="fancy_grid", headers=([i for i in freq_life_n.index])) )
54 print ('\nincome variable ('+INCOME+'):')
55 print ( tabulate([freq_income_n.head(5)], tablefmt="fancy_grid", headers=([i for i in freq_income_n.index])) )
56 print ('\nalcohol variable ('+ALCOHOL+'):')
57 print ( tabulate([freq_life_n.head(5)], tablefmt="fancy_grid", headers=([i for i in freq_life_n.index])) )
58
59 print ('\n********************************************************')
60 print ('* Relative Frequencies original variables (first 5)    *')
61 print ('********************************************************')
62
63 print ('\nlife variable ('+LIFE+'):')
64 print ( tabulate([freq_life_r.head(5)], tablefmt="fancy_grid", headers=([i for i in freq_life_n.index])) )
65 print ('\nincome variable ('+INCOME+'):')
66 print ( tabulate([freq_income_r.head(5)], tablefmt="fancy_grid", headers=([i for i in freq_income_n.index])) )
67 print ('\nalcohol variable ('+ALCOHOL+'):')
68 print ( tabulate([freq_life_r.head(5)], tablefmt="fancy_grid", headers=([i for i in freq_life_n.index])) )
69
70 # Min and Max continuous variables:
71 min_max = OrderedDict()
72 dict1 = OrderedDict()
73
74 dict1['min'] = data1.life.min()
75 dict1['max'] = data1.life.max()
76 min_max['life'] = dict1
77
78 dict2 = OrderedDict()
79 dict2['min'] = data1.income.min()
80 dict2['max'] = data1.income.max()
81 min_max['income'] = dict2
82
83 dict3 = OrderedDict()
84 dict3['min'] = data1.alcohol.min()
85 dict3['max'] = data1.alcohol.max()
86 min_max['alcohol'] = dict3
87
88 df = pd.DataFrame([min_max['income'],min_max['life'],min_max['alcohol']], index = ['Income','Life','Alcohol'])
89 print (tabulate(df.sort_index(axis=1, ascending=False), headers=['Var','Min','Max']))
90
91
92 data2 = data1.copy()
93
94 # Maps
95 income_map = {1: '>=100  <5k', 2: '>=5k <10k', 3: '>=10k <20k',
95 income_map = {1: '>=100  <5k', 2: '>=5k <10k', 3: '>=10k <20k',
96               4: '>=20K <30K', 5: '>=30K <40K', 6: '>=40K <50K' }
97 life_map = {1: '>=40 <50', 2: '>=50 <60', 3: '>=60 <70', 4: '>=70 <80', 5: '>=80 <90'}
98 alcohol_map = {1: '>=0.5 <5', 2: '>=5 <10', 3: '>=10 <15', 4: '>=15 <20', 5: '>=20 <25'}
99
100 data2['income'] = pd.cut(data1.income,[100,5000,10000,20000,30000,40000,50000], labels=['1','2','3','4','5','6        '])
101 print (tabulate(data2.head(10),headers='keys'))
102
103 data2['life'] = pd.cut(data1.life,[40,50,60,70,80,90], labels=['1','2','3','4','5'])
104 print (tabulate(data2.head(10),headers='keys'))
105
106 data2['alcohol'] = pd.cut(data1.alcohol,[0.5,5,10,15,20,25], labels=['1','2','3','4','5'])
107 print (tabulate(data2.head(10),headers='keys'))
108
109 # absolute Frequency distributions
110 freq_life_n = data2.life.value_counts(sort=False)
111 freq_income_n = data2.income.value_counts(sort=False)
112 freq_alcohol_n = data2.alcohol.value_counts(sort=False)
113
114 # Relative Frequency distributions
115 freq_life_r = data2.life.value_counts(sort=False, normalize=True)
116 freq_income_r = data2.income.value_counts(sort=False, normalize=True)
117 freq_alcohol_r = data2.alcohol.value_counts(sort=False, normalize=True)
118
119 print ('************************')
120 print ('* Absolute Frequencies *')
121 print ('************************')
122 print ('\nlife variable ('+LIFE+'):')
123 print( tabulate([freq_life_n], tablefmt="fancy_grid", headers=(life_map.values())))
124 print ('\nincome variable ('+INCOME+'):')
125 print( tabulate([freq_income_n], tablefmt="fancy_grid", headers=(income_map.values())))
126 print ('\nalcohol variable ('+ALCOHOL+'):')
127 print( tabulate([freq_alcohol_n], tablefmt="fancy_grid", headers=(alcohol_map.values())))
128
129 print ('\n************************')
130 print ('* Relative Frequencies *')
131 print ('************************')
132 print ('\nlife variable ('+LIFE+'):')
133 print( tabulate([freq_life_r], tablefmt="fancy_grid", headers=(life_map.values())))
134 print ('\nincome variable ('+INCOME+'):')
135 print( tabulate([freq_income_r], tablefmt="fancy_grid", headers=(income_map.values())))
136 print ('\nalcohol variable ('+ALCOHOL+'):')
137 print( tabulate([freq_alcohol_r], tablefmt="fancy_grid", headers=(alcohol_map.values())))

Written on July 11, 2016