#  Exemple KMeans avec le dataset Titanic

On va utiliser le dataset titanic comme exemple pour le clustering. 
L'idée serait d'organiser les passager en plusieurs groupes en fonction des caractéristiques communes. Pour cela, on va utiliser plusieurs colonnes présentes dans le dataset, et notamment :

- survival:	Survival	0 = No, 1 = Yes
- pclass:	    Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
- sex:	    Sex	
- Age:	    Age in years	
- sibsp:	    # of siblings / spouses aboard the Titanic	
- parch:    # of parents / children aboard the Titanic	


In [50]:
import pandas as pnd

from sklearn.cluster import KMeans


In [51]:
df_titanic = pnd.read_csv('titanic.csv', delimiter=',', header=[0])

df_titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


## Data cleaning

On va d'abord nettoyer les données, en éliminant d'abord les colonnes qu'on ne va pas utiliser, puis en remplissant les données manquantes.  

In [52]:
df_titanic.drop(columns=['PassengerId','Name','Ticket', 'Fare','Cabin','Embarked'],
                inplace=True)

df_titanic['Age'].fillna (df_titanic['Age'].mean(),inplace=True)

df_titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    object 
 3   Age       891 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
dtypes: float64(1), int64(4), object(1)
memory usage: 41.9+ KB


In [53]:
df_titanic.sample(15)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch
437,1,2,female,24.0,2,3
343,0,2,male,25.0,0,0
226,1,2,male,19.0,0,0
383,1,1,female,35.0,1,0
309,1,1,female,30.0,0,0
119,0,3,female,2.0,4,2
607,1,1,male,27.0,0,0
41,0,2,female,27.0,1,0
310,1,1,female,24.0,0,0
249,0,2,male,54.0,1,0


La colonne 'sex' est encore en type "object", il faut donc utiliser un encoder pour la traduire de catégorical en numérique. 

In [54]:
df_titanic = pnd.get_dummies(df_titanic, columns=['Sex'], drop_first=True)

df_titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Age       891 non-null    float64
 3   SibSp     891 non-null    int64  
 4   Parch     891 non-null    int64  
 5   Sex_male  891 non-null    uint8  
dtypes: float64(1), int64(4), uint8(1)
memory usage: 35.8 KB


Les colonnes *'SibSp'* et *'Parch'* donne le nombre de conjoints et d'enfants accompagnant une personne. On va réunir ces valeurs dans une seule et unique colonne *'FamilyNb'* qu'on va créer. On va également créer une colone *'Alone'* indiquant les passagers voyageant seuls.

In [55]:
df_titanic['FamilyNb'] = df_titanic['SibSp'] + df_titanic['Parch']
df_titanic['Alone'] = ( df_titanic['FamilyNb'] == 0)

df_titanic.drop(columns=['SibSp','Parch'],inplace=True)

df_titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Age       891 non-null    float64
 3   Sex_male  891 non-null    uint8  
 4   FamilyNb  891 non-null    int64  
 5   Alone     891 non-null    bool   
dtypes: bool(1), float64(1), int64(3), uint8(1)
memory usage: 29.7 KB


In [56]:
df_titanic.sample(15)

Unnamed: 0,Survived,Pclass,Age,Sex_male,FamilyNb,Alone
699,0,3,42.0,1,0,True
206,0,3,32.0,1,1,False
592,0,3,47.0,1,0,True
817,0,2,31.0,1,2,False
430,1,1,28.0,1,0,True
762,1,3,20.0,1,0,True
427,1,2,19.0,0,0,True
393,1,1,23.0,0,1,False
31,1,1,29.699118,0,1,False
751,1,3,6.0,1,1,False


In [57]:
df_titanic.describe()

Unnamed: 0,Survived,Pclass,Age,Sex_male,FamilyNb
count,891.0,891.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.647587,0.904602
std,0.486592,0.836071,13.002015,0.47799,1.613459
min,0.0,1.0,0.42,0.0,0.0
25%,0.0,2.0,22.0,0.0,0.0
50%,0.0,3.0,29.699118,1.0,0.0
75%,1.0,3.0,35.0,1.0,1.0
max,1.0,3.0,80.0,1.0,10.0


## Création du modèle

On va créer d'abord un premier modèle avec ***k=4*** et on va essayer d'interpreter les cluster obtenus. 

In [58]:
km = KMeans(n_clusters=4, random_state=42)

km.fit(df_titanic)

print(km.inertia_)

21081.28798483158


In [59]:
print (df_titanic.columns)
print (km.cluster_centers_)

Index(['Survived', 'Pclass', 'Age', 'Sex_male', 'FamilyNb', 'Alone'], dtype='object')
[[3.60000000e-01 2.48000000e+00 2.08400000e+01 6.20000000e-01
  7.44000000e-01 6.32000000e-01]
 [3.73239437e-01 1.71126761e+00 5.16408451e+01 6.90140845e-01
  6.47887324e-01 6.05633803e-01]
 [5.79710145e-01 2.63768116e+00 4.77057971e+00 5.36231884e-01
  3.27536232e+00 2.89855072e-02]
 [3.69767442e-01 2.35348837e+00 3.16040554e+01 6.67441860e-01
  7.02325581e-01 6.76744186e-01]]


## Interpretation 

On va essayer d'interpreter les clusters obtenus, en observant notamment les valeurs moyennes des features pour chaque classe grâce au describe.  

In [60]:
# on ajoute les labels obtenus à notre df
df_titanic['labels'] = km.labels_

df_titanic.groupby('labels').describe().transpose()

Unnamed: 0,labels,0,1,2,3
Survived,count,250.0,142.0,69.0,430.0
Survived,mean,0.36,0.373239,0.57971,0.369767
Survived,std,0.480963,0.485377,0.497222,0.483304
Survived,min,0.0,0.0,0.0,0.0
Survived,25%,0.0,0.0,0.0,0.0
Survived,50%,0.0,0.0,1.0,0.0
Survived,75%,1.0,1.0,1.0,1.0
Survived,max,1.0,1.0,1.0,1.0
Pclass,count,250.0,142.0,69.0,430.0
Pclass,mean,2.48,1.711268,2.637681,2.353488


### on observe
On peut, par exemple, observer que le label n° 2 correspond à des jeunes passagers (âge moyenne 4 ans ± 3 ans), voyageant accompagnés (familynb moyen de 3,2), voyageant en 2è et 3è classes et qui ont pour beaucoup survécu (survival moyen 0.579710). 