week 4. Statistical Inference

Notice

Recent Posts

Recent Comments

Link

« 2026/06 »
일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Tags more

Archives

Today

Total

관리 메뉴

statistics program

week 4. Statistical Inference _ 현주 본문

학습 정리/따봉콩쥐야고마워

week 4. Statistical Inference _ 현주

따봉콩쥐야고마워 2024. 6. 29. 19:17

ch04.ipynb

0.01MB

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [ ]:

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

In [ ]:

data = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/ACCIDENTS_GU_BCN_2013.csv",encoding='latin-1')
data['Date'] = '2013-'+data['Mes de any'].apply(lambda x : str(x)) + '-' +  data['Dia de mes'].apply(lambda x : str(x))
data['Date'] = pd.to_datetime(data['Date'])
accidents = data.groupby(['Date']).size()
print(accidents.mean())

25.90958904109589

In [ ]:

# population
df = accidents.to_frame()
N_test = 10000
elements = 200

# mean array of samples
means = [0] * N_test

# sample generation
for i in range(N_test):
    rows = np.random.choice(df.index.values, elements)
    sampled_df = df.loc[rows]
    means[i] = sampled_df.mean()

표본 추출을 통한 표준 오차 추출

In [ ]:

import math

rows = np.random.choice(df.index.values, 200)
sampled_df = df.loc[rows]
est_sigma_mean = sampled_df.std()/math.sqrt(200)
print('Direct estimation of SE from one sample of 200 elements:', est_sigma_mean[0])
print('Estimation of the SE by simulating 10000 samples of 200 elements:', np.array(means).std())

Direct estimation of SE from one sample of 200 elements: 0.6549076424202902
Estimation of the SE by simulating 10000 samples of 200 elements: 0.643482902958579

부트스트랩 방법을 사용하여 데이터의 평균을 추정하는 과정을 구현하고, 이를 통해 데이터에 대한 평균 값을 예측

In [ ]:

def meanBootstrap(X, numberb):
  x = [0]*numberb
  for i in range(numberb):
      sample = [X[j] for j in np.random.randint(len(X), size=len(X))]
      x[i] = np.mean(sample)
  return x

m = meanBootstrap(accidents, 10000)
print("Mean estimate:", np.mean(m))

Mean estimate: 25.91052876712329

In [ ]:

m = accidents.mean()
se = accidents.std() / math.sqrt(len(accidents))
ci = [m - se*1.96, m + se*1.96]
print("Confidence interval: ", ci)

Confidence interval:  [24.975156065800284, 26.8440220163915]

In [ ]:

m = meanBootstrap(accidents, 10000)
sample_mean = np.mean(m)
sample_se = np.std(m)

print("Mean estimate:", sample_mean)
print("SE of the estimate:", sample_se)

ci = [np.percentile(m, 2.5), np.percentile(m, 97.5)]
print("Confidence interval:", ci)

Mean estimate: 25.910285753424656
SE of the estimate: 0.47678386671159984
Confidence interval: [24.964383561643835, 26.83568493150685]

In [ ]:

data = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/ACCIDENTS_GU_BCN_2010.csv",encoding='latin-1')

#Create a new column which is the date
data['Date'] = data['Dia de mes'].apply(lambda x: str(x)) + '-' + data['Mes de any'].apply(lambda x: str(x))
data2 = data['Date']
counts2010 = data['Date'].value_counts()
print('2010: Mean', counts2010.mean())

data = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/ACCIDENTS_GU_BCN_2013.csv",encoding='latin-1')
#Create a new column which is the date
data['Date'] = data['Dia de mes'].apply(lambda x: str(x)) + '-' + data['Mes de any'].apply(lambda x: str(x))
data2 = data['Date']
counts2013 = data['Date'].value_counts()
print('2013: Mean', counts2013.mean())

2010: Mean 24.81095890410959
2013: Mean 25.90958904109589

In [ ]:

n = len(counts2013)
mean = counts2013.mean()
s = counts2013.std()
ci = [mean - s*1.96/np.sqrt(n), mean + s*1.96/np.sqrt(n)]
print('2010 accident rate estimate: ', counts2010.mean())
print('2013 accident rate estimate: ', counts2013.mean())
print('CI for 2013: ',ci)

2010 accident rate estimate:  24.81095890410959
2013 accident rate estimate:  25.90958904109589
CI for 2013:  [24.975156065800284, 26.8440220163915]

In [ ]:

m = len(counts2010)
n = len(counts2013)
p = (counts2013.mean() - counts2010.mean())
print('m: ', m, 'n: ',n)
print('mean difference: ', p)

m:  365 n:  365
mean difference:  1.0986301369863014

In [ ]:

# pooling distributions
x = counts2010
y = counts2013
pool = np.concatenate([x, y])
np.random.shuffle(pool)

#sample generation
import random
N = 10000 #number of sample
diff = []
for i in range(N):
  p1 = [random.choice(pool) for _ in range(n)]
  p2 = [random.choice(pool) for _ in range(n)]
  diff.append(np.mean(p1) - np.mean(p2))

In [ ]:

#counting differences larger than ths observed one
diff2 = np.array(diff)
w1 = np.where(diff2 > p) [0]

print('p-value (Simulation)= ', len(w1)/float(N), '(', len(w1)/float(N)*100,'%)', 'Difference =', p)
if (len(w1)/float(N)) <0.05:
  print('The effect is likely')
else:
  print('The effect is not likely')

p-value (Simulation)=  0.0461 ( 4.61 %) Difference = 1.0986301369863014
The effect is likely

'학습 정리 > 따봉콩쥐야고마워' 카테고리의 다른 글

week 5. Regression Analysis _ 현주 (0)	2024.06.29
week 3. Descriptive Statistics _ 현주 (0)	2024.06.29
week 2. Toolboxes for Data Scientists _ 현주 (0)	2024.06.29
week 1 . 초기 환경 설정 _ 현주 (0)	2024.05.23

'학습 정리/따봉콩쥐야고마워' Related Articles

statistics program

week 4. Statistical Inference _ 현주 본문

week 4. Statistical Inference _ 현주

'학습 정리 > 따봉콩쥐야고마워' 카테고리의 다른 글

티스토리툴바