Open Class PythonÂ¶

Python ITB
Makers Institute

13 February 2018

About UsÂ¶

Since 12 January 2016
First open class: September 2016

About MeÂ¶

Aris Budi WibowoÂ¶

about.me/arisbw | @arisbw

OutlineÂ¶

Introduction to Python
Let's use Pandas!
Introduction to Machine Learning
What the hell is KNN?

You can also view this repository from: https://github.com/arisbw/open-class-makers

Introduction to PythonÂ¶

Say hello to dat snake. 🐍

In [1]:

print("Hello Python!")

Hello Python!

Object types in PythonÂ¶

NumberÂ¶

If you forgot what the simple math looks like...

In [2]:

1 + 1

Out[2]:

In [3]:

8 - 1

Out[3]:

In [4]:

10 * 2

Out[4]:

In [5]:

35 / 5

Out[5]:

7.0

In [6]:

5 // 3

Out[6]:

In [7]:

5.0 // 3.0

Out[7]:

1.0

In [2]:

2**4

Out[2]:

StringÂ¶

In [28]:

s = "ayam"
len(s)

Out[28]:

In [29]:

s[0]

Out[29]:

'a'

Or, we can do this!

In [30]:

s[-1]

Out[30]:

'm'

In [31]:

s[-2]

Out[31]:

'a'

We can also slice that string

In [32]:

s[0:3]

Out[32]:

'aya'

In [33]:

s[1:]

Out[33]:

'yam'

In [34]:

s[:5]

Out[34]:

'ayam'

In [35]:

s[:-1]

Out[35]:

'aya'

Or add with another string

In [36]:

s + ' ' +'euy'

Out[36]:

'ayam euy'

In [37]:

s*4

Out[37]:

'ayamayamayamayam'

Find and replace substring

In [39]:

s.find('yam')

Out[39]:

In [40]:

s.replace('ayam', 'bebek')

Out[40]:

'bebek'

In [42]:

line = 'aaa,bbb,ccccc,dd'

In [43]:

line.split(',')

Out[43]:

['aaa', 'bbb', 'ccccc', 'dd']

In [41]:

Out[41]:

'ayam'

In [44]:

s.upper()

Out[44]:

'AYAM'

In [50]:

s.isalpha()

Out[50]:

True

ListÂ¶

In [1]:

L = [123, 'spam', 1.23]

In [2]:

L[0]

Out[2]:

In [3]:

L[:-1]

Out[3]:

[123, 'spam']

In [4]:

L + [4, 5, 6]

Out[4]:

[123, 'spam', 1.23, 4, 5, 6]

In [5]:

Out[5]:

[123, 'spam', 1.23]

In [6]:

L.append('hehe')

In [7]:

Out[7]:

[123, 'spam', 1.23, 'hehe']

In [8]:

L.pop(1)

Out[8]:

'spam'

In [9]:

Out[9]:

[123, 1.23, 'hehe']

DictionariesÂ¶

In [117]:

D = {'food': 'Spam', 'quantity': 4, 'color': 'pink'}

In [118]:

D['food']

Out[118]:

'Spam'

In [119]:

D['quantity']

Out[119]:

In [120]:

D = {}
D['name'] = 'Bob'# Create keys by assignment
D['job']  = 'dev'
D['age']  = 40

In [121]:

Out[121]:

{'age': 40, 'job': 'dev', 'name': 'Bob'}

What if...Â¶

In [8]:

a = 90

if a > 50:
    print("Lebih dari 50")
else:
    print("Kurang dari 50")

Lebih dari 50

In [9]:

a = 3
"seeep" if a > 2 else 2

Out[9]:

'seeep'

Now, what ifs...

In [10]:

a = 400

if a < 100:
    print("Kurang dari 100")
elif a >= 100 and a <= 500:
    print("Antara 100 dan 500")
else:
    print("Lebih dari 500")

Antara 100 dan 500

... say that again?

In [11]:

a = 5

for i in range(a):
    if a > 10:
        print("Lebih dari 10")
    else:
        print("Kurang dari 10")

Kurang dari 10
Kurang dari 10
Kurang dari 10
Kurang dari 10
Kurang dari 10

In [12]:

a = 0

while a < 5:
    print("Gampang euy")
    a += 1

Gampang euy
Gampang euy
Gampang euy
Gampang euy
Gampang euy

Functions!

In [13]:

def tambah(a=0,b=0):
    c = a+b
    return c

tambah(2,3)

Out[13]:

Pandas!Â¶

Initialize libraries that we will use later.

In [1]:

import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

First, let's download the data we need.

In [2]:

!curl https://raw.githubusercontent.com/fivethirtyeight/data/master/bob-ross/elements-by-episode.csv -O elements-by-episode.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 66316  100 66316    0     0   121k      0 --:--:-- --:--:-- --:--:--  121k

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

Or you can use wget if you use linux.

In [15]:

df  =  pd.read_csv("elements-by-episode.csv")

Let's peek the data.

In [16]:

df.head()

Out[16]:

	EPISODE	TITLE	BUSHES	...	TREE	TREES	WINTER
0	S01E01	"A WALK IN THE WOODS"	1	...	1	1	0
1	S01E02	"MT. MCKINLEY"	0	...	1	1	1
2	S01E03	"EBONY SUNSET"	0	...	1	1	1
3	S01E04	"WINTER MIST"	1	...	1	1	0
4	S01E05	"QUIET STREAM"	0	...	1	1	0

5 rows × 69 columns

In [17]:

df.columns

Out[17]:

Index(['EPISODE', 'TITLE', 'APPLE_FRAME', 'AURORA_BOREALIS', 'BARN', 'BEACH',
       'BOAT', 'BRIDGE', 'BUILDING', 'BUSHES', 'CABIN', 'CACTUS',
       'CIRCLE_FRAME', 'CIRRUS', 'CLIFF', 'CLOUDS', 'CONIFER', 'CUMULUS',
       'DECIDUOUS', 'DIANE_ANDRE', 'DOCK', 'DOUBLE_OVAL_FRAME', 'FARM',
       'FENCE', 'FIRE', 'FLORIDA_FRAME', 'FLOWERS', 'FOG', 'FRAMED', 'GRASS',
       'GUEST', 'HALF_CIRCLE_FRAME', 'HALF_OVAL_FRAME', 'HILLS', 'LAKE',
       'LAKES', 'LIGHTHOUSE', 'MILL', 'MOON', 'MOUNTAIN', 'MOUNTAINS', 'NIGHT',
       'OCEAN', 'OVAL_FRAME', 'PALM_TREES', 'PATH', 'PERSON', 'PORTRAIT',
       'RECTANGLE_3D_FRAME', 'RECTANGULAR_FRAME', 'RIVER', 'ROCKS',
       'SEASHELL_FRAME', 'SNOW', 'SNOWY_MOUNTAIN', 'SPLIT_FRAME', 'STEVE_ROSS',
       'STRUCTURE', 'SUN', 'TOMB_FRAME', 'TREE', 'TREES', 'TRIPLE_FRAME',
       'WATERFALL', 'WAVES', 'WINDMILL', 'WINDOW_FRAME', 'WINTER',
       'WOOD_FRAMED'],
      dtype='object')

In [18]:

df.shape

Out[18]:

(403, 69)

Summary of the data:

In [19]:

df.describe()

Out[19]:

	APPLE_FRAME	AURORA_BOREALIS	BARN	BEACH	BOAT	BRIDGE	BUILDING	BUSHES	CABIN	CACTUS	...	TOMB_FRAME	TREE	TREES	TRIPLE_FRAME	WATERFALL	WAVES	WINDMILL	WINDOW_FRAME	WINTER	WOOD_FRAMED
count	403.000000	403.000000	403.000000	403.000000	403.000000	403.000000	403.000000	403.000000	403.000000	403.000000	...	403.000000	403.000000	403.000000	403.000000	403.000000	403.000000	403.000000	403.000000	403.000000	403.000000
mean	0.002481	0.004963	0.042184	0.066998	0.004963	0.017370	0.002481	0.297767	0.171216	0.009926	...	0.002481	0.895782	0.836228	0.002481	0.096774	0.084367	0.002481	0.002481	0.171216	0.002481
std	0.049814	0.070359	0.201258	0.250328	0.070359	0.130807	0.049814	0.457845	0.377166	0.099255	...	0.049814	0.305923	0.370528	0.049814	0.296018	0.278283	0.049814	0.049814	0.377166	0.049814
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	1.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	1.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
75%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	...	0.000000	1.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
max	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	...	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000

8 rows × 67 columns

Transpose the data

In [20]:

df.T.head()

Out[20]:

	0	1	2	3	4	5	6	7	8	9	...	393	394	395	396	397	398	399	400	401	402
EPISODE	S01E01	S01E02	S01E03	S01E04	S01E05	S01E06	S01E07	S01E08	S01E09	S01E10	...	S31E04	S31E05	S31E06	S31E07	S31E08	S31E09	S31E10	S31E11	S31E12	S31E13
TITLE	"A WALK IN THE WOODS"	"MT. MCKINLEY"	"EBONY SUNSET"	"WINTER MIST"	"QUIET STREAM"	"WINTER MOON"	"AUTUMN MOUNTAINS"	"PEACEFUL VALLEY"	"SEASCAPE"	"MOUNTAIN LAKE"	...	"TRANQUILITY COVE"	"CABIN IN THE HOLLOW"	"VIEW FROM CLEAR CREEK"	"BRIDGE TO AUTUMN"	"TRAIL'S END"	"EVERGREEN VALLEY"	"BALMY BEACH"	"LAKE AT THE RIDGE"	"IN THE MIDST OF WINTER"	"WILDERNESS DAY"
APPLE_FRAME	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
AURORA_BOREALIS	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
BARN	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	1	0

5 rows × 403 columns

Sort the data

In [26]:

df.sort_values(by='TREES',ascending=False).head()

Out[26]:

	EPISODE	TITLE	BUSHES	...	TREE	TREES	WATERFALL	WINTER
0	S01E01	"A WALK IN THE WOODS"	1	...	1	1	0	0
263	S21E04	"SERENITY"	0	...	1	1	1	0
261	S21E02	"TRANQUIL DAWN"	0	...	1	1	0	0
260	S21E01	"VALLEY VIEW"	0	...	1	1	0	0
259	S20E13	"DOUBLE TAKE"	1	...	1	1	0	1

5 rows × 69 columns

Select and Slice

In [21]:

df.EPISODE[1:5]

Out[21]:

1    S01E02
2    S01E03
3    S01E04
4    S01E05
Name: EPISODE, dtype: object

In [22]:

df['EPISODE'][1:5]

Out[22]:

1    S01E02
2    S01E03
3    S01E04
4    S01E05
Name: EPISODE, dtype: object

In [23]:

df.loc[0:2]

Out[23]:

	EPISODE	TITLE	BUSHES	...	TREE	TREES	WINTER
0	S01E01	"A WALK IN THE WOODS"	1	...	1	1	0
1	S01E02	"MT. MCKINLEY"	0	...	1	1	1
2	S01E03	"EBONY SUNSET"	0	...	1	1	1

3 rows × 69 columns

In [29]:

df.loc[100:106,['WATERFALL','TREE']]

Out[29]:

	WATERFALL	TREE
100	0	0
101	0	1
102	0	1
103	0	1
104	0	1
105	0	0
106	0	1

In [37]:

df[(df.BRIDGE>0) & (df.WINTER>0)]

Out[37]:

	EPISODE	TITLE	APPLE_FRAME	AURORA_BOREALIS	BARN	BEACH	BOAT	BRIDGE	BUILDING	BUSHES	...	TOMB_FRAME	TREE	TREES	TRIPLE_FRAME	WATERFALL	WAVES	WINDMILL	WINDOW_FRAME	WINTER	WOOD_FRAMED
240	S19E07	"COVERED BRIDGE OVAL"	0	0	0	0	0	1	0	0	...	0	1	1	0	0	0	0	0	1	0

1 rows × 69 columns

In [82]:

df1 = df.iloc[:,3:6]

In [83]:

df1.head()

Out[83]:

	AURORA_BOREALIS	BARN	BEACH
0	0	0	0
1	0	0	0
2	0	0	0
3	0	0	0
4	0	0	0

In [84]:

df1.plot.hist()

Out[84]:

<matplotlib.axes._subplots.AxesSubplot at 0x1feccad0cc0>

In [95]:

df1.iloc[30:100,:].plot(figsize=(8, 8))

Out[95]:

<matplotlib.axes._subplots.AxesSubplot at 0x1fecde85748>

a bit Intro to Machine LearningÂ¶

What is Machine Learning?Â¶

Machine Learning: ... a method of teaching computers to make and improve predictions or behaviors based on some data.

Basically, there are 3* main task in machine learning:

Supervised learning: your examples must be labeled

Unsupervised learning: your examples are not labeled

Reinforcement learning: based on reward and punishment

Why Machine Learning?Â¶

I mean... why now?

So many algorithms that are really good right now:

Tree-based algorithms: Random forest, XGboost, catboost

Deep learning

etc.

a bit of KNNÂ¶

DISCLAIMER: This is the oversimplified version

A classification and regression method. In this case, we will explore in the classification problem.

In [97]:

from sklearn import neighbors
import numpy as np
%matplotlib inline
import seaborn

Create dataset:

In [98]:

training_data = pd.DataFrame()

training_data['test_1'] = [0.3051,0.4949,0.6974,0.3769,0.2231,0.341,0.4436,0.5897,0.6308,0.5]
training_data['test_2'] = [0.5846,0.2654,0.2615,0.4538,0.4615,0.8308,0.4962,0.3269,0.5346,0.6731]
training_data['outcome'] = ['win','win','win','win','win','loss','loss','loss','loss','loss']

training_data.head()

Out[98]:

	test_1	test_2	outcome
0	0.3051	0.5846	win
1	0.4949	0.2654	win
2	0.6974	0.2615	win
3	0.3769	0.4538	win
4	0.2231	0.4615	win

Plot the data

In [99]:

seaborn.lmplot('test_1', 'test_2', data=training_data, fit_reg=False,hue="outcome", scatter_kws={"marker": "D","s": 100})

Out[99]:

<seaborn.axisgrid.FacetGrid at 0x1fece93a748>

Convert data to np.arrays

In [100]:

X = training_data.as_matrix(columns=['test_1', 'test_2'])
y = np.array(training_data['outcome'])

Training!

In [101]:

clf = neighbors.KNeighborsClassifier(3, weights = 'uniform')
trained_model = clf.fit(X, y)

drum rolls 🥁

In [102]:

trained_model.score(X, y)

Out[102]:

0.80000000000000004

Check the model result to new data point

In [103]:

x_test = np.array([[.4,.6]])

In [104]:

trained_model.predict(x_test)

Out[104]:

array(['loss'], dtype=object)

Check result

In [105]:

trained_model.predict_proba(x_test)

Out[105]:

array([[ 0.66666667,  0.33333333]])

Best k?Â¶

Load libraries

In [109]:

from sklearn.neighbors import KNeighborsClassifier
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV

Load Iris Flower Data

In [110]:

iris = datasets.load_iris()
X = iris.data
y = iris.target

Standardize Data

In [111]:

standardizer = StandardScaler()

X_std = standardizer.fit_transform(X)

Create knn model

In [112]:

knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean', n_jobs=-1).fit(X_std, y)

Create search space of possible value of k

In [113]:

pipe = Pipeline([('standardizer', standardizer), ('knn', knn)])

search_space = [{'knn__n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}]

Search!

In [115]:

clf = GridSearchCV(pipe, search_space, cv=5, verbose=0).fit(X_std, y)

Show best k

In [116]:

clf.best_estimator_.get_params()['knn__n_neighbors']

Out[116]:

ReferencesÂ¶

Thank you! 🙇‍🙏Â¶

Misc.Â¶

In [1]:

%mkdir image\paintings
%cd image\paintings

D:\Google Drive\Python ITB\Kelas Makers\image\paintings

In [2]:

#Credit to: Jeff Thompson (https://gist.github.com/jeffThompson/6d4c45f89dc907925775972e72d9cf00)

num_images=403

import urllib.request
from tqdm import *

for i in tqdm(range(1, num_images+1)):
    try:
        url = 'http://www.twoinchbrush.com/images/painting' + str(i) + '.png'
        filename = ('%03d' % (i,)) + '.png'
        urllib.request.urlretrieve(url, filename)
    except:
        print('- ERROR!')

100%|████████████████████████████████████████████████████████████████████████████████| 403/403 [14:12<00:00,  2.37s/it]

In [ ]: