Notice
Recent Posts
Recent Comments
Link
«   2026/06   »
1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30
Tags
more
Archives
Today
Total
관리 메뉴

statistics program

week 4. Statistical Inference _ 현주 본문

학습 정리/따봉콩쥐야고마워

week 4. Statistical Inference _ 현주

따봉콩쥐야고마워 2024. 6. 29. 19:17

 

ch04.ipynb
0.01MB

 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
In [ ]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [ ]:
data = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/ACCIDENTS_GU_BCN_2013.csv",encoding='latin-1')
data['Date'] = '2013-'+data['Mes de any'].apply(lambda x : str(x)) + '-' +  data['Dia de mes'].apply(lambda x : str(x))
data['Date'] = pd.to_datetime(data['Date'])
accidents = data.groupby(['Date']).size()
print(accidents.mean())
25.90958904109589
In [ ]:
# population
df = accidents.to_frame()
N_test = 10000
elements = 200

# mean array of samples
means = [0] * N_test

# sample generation
for i in range(N_test):
    rows = np.random.choice(df.index.values, elements)
    sampled_df = df.loc[rows]
    means[i] = sampled_df.mean()
  • 표본 추출을 통한 표준 오차 추출
In [ ]:
import math

rows = np.random.choice(df.index.values, 200)
sampled_df = df.loc[rows]
est_sigma_mean = sampled_df.std()/math.sqrt(200)
print('Direct estimation of SE from one sample of 200 elements:', est_sigma_mean[0])
print('Estimation of the SE by simulating 10000 samples of 200 elements:', np.array(means).std())
Direct estimation of SE from one sample of 200 elements: 0.6549076424202902
Estimation of the SE by simulating 10000 samples of 200 elements: 0.643482902958579
  • 부트스트랩 방법을 사용하여 데이터의 평균을 추정하는 과정을 구현하고, 이를 통해 데이터에 대한 평균 값을 예측
In [ ]:
def meanBootstrap(X, numberb):
  x = [0]*numberb
  for i in range(numberb):
      sample = [X[j] for j in np.random.randint(len(X), size=len(X))]
      x[i] = np.mean(sample)
  return x

m = meanBootstrap(accidents, 10000)
print("Mean estimate:", np.mean(m))
Mean estimate: 25.91052876712329
In [ ]:
m = accidents.mean()
se = accidents.std() / math.sqrt(len(accidents))
ci = [m - se*1.96, m + se*1.96]
print("Confidence interval: ", ci)
Confidence interval:  [24.975156065800284, 26.8440220163915]
In [ ]:
m = meanBootstrap(accidents, 10000)
sample_mean = np.mean(m)
sample_se = np.std(m)

print("Mean estimate:", sample_mean)
print("SE of the estimate:", sample_se)

ci = [np.percentile(m, 2.5), np.percentile(m, 97.5)]
print("Confidence interval:", ci)
Mean estimate: 25.910285753424656
SE of the estimate: 0.47678386671159984
Confidence interval: [24.964383561643835, 26.83568493150685]
In [ ]:
data = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/ACCIDENTS_GU_BCN_2010.csv",encoding='latin-1')

#Create a new column which is the date
data['Date'] = data['Dia de mes'].apply(lambda x: str(x)) + '-' + data['Mes de any'].apply(lambda x: str(x))
data2 = data['Date']
counts2010 = data['Date'].value_counts()
print('2010: Mean', counts2010.mean())

data = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/ACCIDENTS_GU_BCN_2013.csv",encoding='latin-1')
#Create a new column which is the date
data['Date'] = data['Dia de mes'].apply(lambda x: str(x)) + '-' + data['Mes de any'].apply(lambda x: str(x))
data2 = data['Date']
counts2013 = data['Date'].value_counts()
print('2013: Mean', counts2013.mean())
2010: Mean 24.81095890410959
2013: Mean 25.90958904109589
In [ ]:
n = len(counts2013)
mean = counts2013.mean()
s = counts2013.std()
ci = [mean - s*1.96/np.sqrt(n), mean + s*1.96/np.sqrt(n)]
print('2010 accident rate estimate: ', counts2010.mean())
print('2013 accident rate estimate: ', counts2013.mean())
print('CI for 2013: ',ci)
2010 accident rate estimate:  24.81095890410959
2013 accident rate estimate:  25.90958904109589
CI for 2013:  [24.975156065800284, 26.8440220163915]
In [ ]:
m = len(counts2010)
n = len(counts2013)
p = (counts2013.mean() - counts2010.mean())
print('m: ', m, 'n: ',n)
print('mean difference: ', p)
m:  365 n:  365
mean difference:  1.0986301369863014
In [ ]:
# pooling distributions
x = counts2010
y = counts2013
pool = np.concatenate([x, y])
np.random.shuffle(pool)

#sample generation
import random
N = 10000 #number of sample
diff = []
for i in range(N):
  p1 = [random.choice(pool) for _ in range(n)]
  p2 = [random.choice(pool) for _ in range(n)]
  diff.append(np.mean(p1) - np.mean(p2))
In [ ]:
#counting differences larger than ths observed one
diff2 = np.array(diff)
w1 = np.where(diff2 > p) [0]

print('p-value (Simulation)= ', len(w1)/float(N), '(', len(w1)/float(N)*100,'%)', 'Difference =', p)
if (len(w1)/float(N)) <0.05:
  print('The effect is likely')
else:
  print('The effect is not likely')
p-value (Simulation)=  0.0461 ( 4.61 %) Difference = 1.0986301369863014
The effect is likely