pythonのpandasパッケージを使ったpivot tabelのご紹介です。Excelのピポットテーブルと似たような感覚で使えると思います。
データ準備
>>> import pandas as pd
>>> d = pd.read_csv('https://raw.githubusercontent.com/maruko-rosso/datasciencehenomiti/master/data/ShopSales.csv')
>>> d.head()
date shop staff item price quantity
0 2018/2/1 New York Kory itemA 290000 1
1 2018/2/1 New York Kory itemB 160000 1
2 2018/2/1 New York Kory itemC 14000 1
3 2018/2/1 San Francisco Rose itemD 570000 3
4 2018/2/1 New York Kory itemE 11100 1
pivot table の作成:単純集計
columnsを指定することで列側のデータを指定することもできます。
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px 'Andale Mono'; color: #2fff12; background-color: #000000; background-color: rgba(0, 0, 0, 0.9)} span.s1 {font-variant-ligatures: no-common-ligatures}
>>> pd.pivot_table(d, index = 'staff', values = 'price', aggfunc = sum)
price
staff
Dawn 38911860
Donna 26332230
Hank 6425110
Kory 28116970
Rachel 270288250
Richard 36471850
Rose 12366200
pivot table の作成:クロス集計
aggfuncにmeanやmedian、countを設定することも可能です。
>>> pd.pivot_table(d, index = 'staff', values = 'price', columns = 'shop', aggfunc = sum)
shop Chicago New York San Francisco
staff
Dawn 1721100.0 37190760.0 NaN
Donna 580000.0 25752230.0 NaN
Hank NaN NaN 6425110.0
Kory NaN 28116970.0 NaN
Rachel 28601900.0 146487370.0 95198980.0
Richard NaN NaN 36471850.0
Rose NaN 2995100.0 9371100.0
上記で発生していたNaNは以下のようにすることで入れ替えできます。
>>> pd.pivot_table(d, index = 'staff', values = 'price', columns = 'shop', aggfunc = sum,fill_value = 0)
shop Chicago New York San Francisco
staff
Dawn 1721100 37190760 0
Donna 580000 25752230 0
Hank 0 0 6425110
Kory 0 28116970 0
Rachel 28601900 146487370 95198980
Richard 0 0 36471850
Rose 0 2995100 9371100
クロス集計表の軸を増やして対応することができます。
>>> pd.pivot_table(d,values = 'price', index = ['shop', 'staff'],
... columns = ['item'], aggfunc = sum)
item itemA itemB itemC itemD \
shop staff
Chicago Dawn NaN NaN NaN 1710000.0
Donna 290000.0 160000.0 NaN NaN
Rachel 6960000.0 2880000.0 56000.0 17100000.0
New York Dawn 8410000.0 4320000.0 210000.0 21090000.0
Donna 3480000.0 5120000.0 70000.0 14250000.0
Kory 6090000.0 2720000.0 28000.0 16530000.0
Rachel 35090000.0 23040000.0 616000.0 74100000.0
Rose 870000.0 960000.0 14000.0 1140000.0
San Francisco Hank 2030000.0 1760000.0 126000.0 1710000.0
Rachel 20590000.0 20480000.0 630000.0 44460000.0
Richard 9280000.0 5600000.0 140000.0 18240000.0
Rose 1740000.0 800000.0 14000.0 5700000.0
item itemE itemF itemG itemH itemI
shop staff
Chicago Dawn 11100.0 NaN NaN NaN NaN
Donna NaN NaN 130000.0 NaN NaN
Rachel 543900.0 737000.0 325000.0 NaN NaN
New York Dawn 444000.0 1474000.0 1235000.0 7760.0 NaN
Donna 499500.0 1742000.0 585000.0 NaN 5730.0
Kory 477300.0 1742000.0 520000.0 7760.0 1910.0
Rachel 1254300.0 9112000.0 3250000.0 15520.0 9550.0
Rose 11100.0 NaN NaN NaN NaN
San Francisco Hank 133200.0 469000.0 195000.0 NaN 1910.0
Rachel 1365300.0 5360000.0 2275000.0 31040.0 7640.0
Richard 477300.0 1876000.0 845000.0 11640.0 1910.0
Rose 122100.0 670000.0 325000.0 NaN NaN
クロス集計②:複数軸と複数計算
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px 'Andale Mono'; color: #2fff12; background-color: #000000; background-color: rgba(0, 0, 0, 0.9)} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px 'Andale Mono'; color: #2fff12; background-color: #000000; background-color: rgba(0, 0, 0, 0.9); min-height: 14.0px} span.s1 {font-variant-ligatures: no-common-ligatures}
>>> pd.pivot_table(d,values = 'price', index = ['shop', 'staff'],
... columns = ['item'], aggfunc = {'price':[sum,min]})
min \
item itemA itemB itemC itemD itemE
shop staff
Chicago Dawn NaN NaN NaN 570000.0 11100.0
Donna 290000.0 160000.0 NaN NaN NaN
Rachel 290000.0 160000.0 14000.0 570000.0 11100.0
New York Dawn 290000.0 160000.0 14000.0 570000.0 11100.0
Donna 290000.0 160000.0 14000.0 570000.0 11100.0
Kory 290000.0 160000.0 14000.0 570000.0 11100.0
Rachel 290000.0 160000.0 14000.0 570000.0 11100.0
Rose 290000.0 160000.0 14000.0 570000.0 11100.0
San Francisco Hank 290000.0 160000.0 14000.0 570000.0 11100.0
Rachel 290000.0 160000.0 14000.0 570000.0 11100.0
Richard 290000.0 160000.0 14000.0 570000.0 11100.0
Rose 290000.0 160000.0 14000.0 570000.0 11100.0
sum \
item itemF itemG itemH itemI itemA
shop staff
Chicago Dawn NaN NaN NaN NaN NaN
Donna NaN 65000.0 NaN NaN 290000.0
Rachel 67000.0 65000.0 NaN NaN 6960000.0
New York Dawn 67000.0 65000.0 3880.0 NaN 8410000.0
Donna 67000.0 65000.0 NaN 1910.0 3480000.0
Kory 67000.0 65000.0 3880.0 1910.0 6090000.0
Rachel 67000.0 65000.0 3880.0 1910.0 35090000.0
Rose NaN NaN NaN NaN 870000.0
San Francisco Hank 67000.0 65000.0 NaN 1910.0 2030000.0
Rachel 67000.0 65000.0 3880.0 1910.0 20590000.0
Richard 67000.0 65000.0 3880.0 1910.0 9280000.0
Rose 67000.0 65000.0 NaN NaN 1740000.0
\
item itemB itemC itemD itemE itemF
shop staff
Chicago Dawn NaN NaN 1710000.0 11100.0 NaN
Donna 160000.0 NaN NaN NaN NaN
Rachel 2880000.0 56000.0 17100000.0 543900.0 737000.0
New York Dawn 4320000.0 210000.0 21090000.0 444000.0 1474000.0
Donna 5120000.0 70000.0 14250000.0 499500.0 1742000.0
Kory 2720000.0 28000.0 16530000.0 477300.0 1742000.0
Rachel 23040000.0 616000.0 74100000.0 1254300.0 9112000.0
Rose 960000.0 14000.0 1140000.0 11100.0 NaN
San Francisco Hank 1760000.0 126000.0 1710000.0 133200.0 469000.0
Rachel 20480000.0 630000.0 44460000.0 1365300.0 5360000.0
Richard 5600000.0 140000.0 18240000.0 477300.0 1876000.0
Rose 800000.0 14000.0 5700000.0 122100.0 670000.0
item itemG itemH itemI
shop staff
Chicago Dawn NaN NaN NaN
Donna 130000.0 NaN NaN
Rachel 325000.0 NaN NaN
New York Dawn 1235000.0 7760.0 NaN
Donna 585000.0 NaN 5730.0
Kory 520000.0 7760.0 1910.0
Rachel 3250000.0 15520.0 9550.0
Rose NaN NaN NaN
San Francisco Hank 195000.0 NaN 1910.0
Rachel 2275000.0 31040.0 7640.0
Richard 845000.0 11640.0 1910.0
Rose 325000.0 NaN NaN