statistics program
week 4. Statistical Inference _ 현주 본문

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
In [ ]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [ ]:
data = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/ACCIDENTS_GU_BCN_2013.csv",encoding='latin-1')
data['Date'] = '2013-'+data['Mes de any'].apply(lambda x : str(x)) + '-' + data['Dia de mes'].apply(lambda x : str(x))
data['Date'] = pd.to_datetime(data['Date'])
accidents = data.groupby(['Date']).size()
print(accidents.mean())
25.90958904109589
In [ ]:
# population
df = accidents.to_frame()
N_test = 10000
elements = 200
# mean array of samples
means = [0] * N_test
# sample generation
for i in range(N_test):
rows = np.random.choice(df.index.values, elements)
sampled_df = df.loc[rows]
means[i] = sampled_df.mean()
- 표본 추출을 통한 표준 오차 추출
In [ ]:
import math
rows = np.random.choice(df.index.values, 200)
sampled_df = df.loc[rows]
est_sigma_mean = sampled_df.std()/math.sqrt(200)
print('Direct estimation of SE from one sample of 200 elements:', est_sigma_mean[0])
print('Estimation of the SE by simulating 10000 samples of 200 elements:', np.array(means).std())
Direct estimation of SE from one sample of 200 elements: 0.6549076424202902
Estimation of the SE by simulating 10000 samples of 200 elements: 0.643482902958579
- 부트스트랩 방법을 사용하여 데이터의 평균을 추정하는 과정을 구현하고, 이를 통해 데이터에 대한 평균 값을 예측
In [ ]:
def meanBootstrap(X, numberb):
x = [0]*numberb
for i in range(numberb):
sample = [X[j] for j in np.random.randint(len(X), size=len(X))]
x[i] = np.mean(sample)
return x
m = meanBootstrap(accidents, 10000)
print("Mean estimate:", np.mean(m))
Mean estimate: 25.91052876712329
In [ ]:
m = accidents.mean()
se = accidents.std() / math.sqrt(len(accidents))
ci = [m - se*1.96, m + se*1.96]
print("Confidence interval: ", ci)
Confidence interval: [24.975156065800284, 26.8440220163915]
In [ ]:
m = meanBootstrap(accidents, 10000)
sample_mean = np.mean(m)
sample_se = np.std(m)
print("Mean estimate:", sample_mean)
print("SE of the estimate:", sample_se)
ci = [np.percentile(m, 2.5), np.percentile(m, 97.5)]
print("Confidence interval:", ci)
Mean estimate: 25.910285753424656
SE of the estimate: 0.47678386671159984
Confidence interval: [24.964383561643835, 26.83568493150685]
In [ ]:
data = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/ACCIDENTS_GU_BCN_2010.csv",encoding='latin-1')
#Create a new column which is the date
data['Date'] = data['Dia de mes'].apply(lambda x: str(x)) + '-' + data['Mes de any'].apply(lambda x: str(x))
data2 = data['Date']
counts2010 = data['Date'].value_counts()
print('2010: Mean', counts2010.mean())
data = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/ACCIDENTS_GU_BCN_2013.csv",encoding='latin-1')
#Create a new column which is the date
data['Date'] = data['Dia de mes'].apply(lambda x: str(x)) + '-' + data['Mes de any'].apply(lambda x: str(x))
data2 = data['Date']
counts2013 = data['Date'].value_counts()
print('2013: Mean', counts2013.mean())
2010: Mean 24.81095890410959
2013: Mean 25.90958904109589
In [ ]:
n = len(counts2013)
mean = counts2013.mean()
s = counts2013.std()
ci = [mean - s*1.96/np.sqrt(n), mean + s*1.96/np.sqrt(n)]
print('2010 accident rate estimate: ', counts2010.mean())
print('2013 accident rate estimate: ', counts2013.mean())
print('CI for 2013: ',ci)
2010 accident rate estimate: 24.81095890410959
2013 accident rate estimate: 25.90958904109589
CI for 2013: [24.975156065800284, 26.8440220163915]
In [ ]:
m = len(counts2010)
n = len(counts2013)
p = (counts2013.mean() - counts2010.mean())
print('m: ', m, 'n: ',n)
print('mean difference: ', p)
m: 365 n: 365
mean difference: 1.0986301369863014
In [ ]:
# pooling distributions
x = counts2010
y = counts2013
pool = np.concatenate([x, y])
np.random.shuffle(pool)
#sample generation
import random
N = 10000 #number of sample
diff = []
for i in range(N):
p1 = [random.choice(pool) for _ in range(n)]
p2 = [random.choice(pool) for _ in range(n)]
diff.append(np.mean(p1) - np.mean(p2))
In [ ]:
#counting differences larger than ths observed one
diff2 = np.array(diff)
w1 = np.where(diff2 > p) [0]
print('p-value (Simulation)= ', len(w1)/float(N), '(', len(w1)/float(N)*100,'%)', 'Difference =', p)
if (len(w1)/float(N)) <0.05:
print('The effect is likely')
else:
print('The effect is not likely')
p-value (Simulation)= 0.0461 ( 4.61 %) Difference = 1.0986301369863014
The effect is likely
'학습 정리 > 따봉콩쥐야고마워' 카테고리의 다른 글
| week 5. Regression Analysis _ 현주 (0) | 2024.06.29 |
|---|---|
| week 3. Descriptive Statistics _ 현주 (0) | 2024.06.29 |
| week 2. Toolboxes for Data Scientists _ 현주 (0) | 2024.06.29 |
| week 1 . 초기 환경 설정 _ 현주 (0) | 2024.05.23 |