PostgreSQL 技术之家: Apache MADlib for Greenplum安装

注意: 此插件只是在测试环境论证过，如若在生产环境使用，请自行评估风险
参考:
https://yq.aliyun.com/articles/157
https://blog.csdn.net/wzy0623/article/details/78845020

★下载对应的版本的rpm包
http://madlib.apache.org/

★RPM安装

1.确认madlib是否存在，存在则执行步骤2
># rpm -qa | grep madlib
2. 卸载madlib包
># rpm -e  madlib-1.14-1.x86_64
3. 安装madlib包
>#rpm -ivh apache-madlib-1.14-bin-Linux.rpm

★确认安装的平台信息

># su - demo
>$ psql postgres
postgres=# select version();
                                                                                                  version                                                                  
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------
 PostgreSQL 8.3.23 (Greenplum Database 5.0.0 build dev) on x86_64-pc-linux-gnu, compiled by GCC gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11), 64-bit compiled on Jul  2 2018
 17:14:09 (with assert checking)
(1 row)
postgres=#

★安装

># su - demo
>$ /usr/local/madlib/bin/madpack install -s madlib -p greenplum -c demo@127.0.0.1:5432/postgres
madpack.py: INFO : Detected Greenplum DB version 5.0.0.
madpack.py: INFO : *** Installing MADlib ***
madpack.py: INFO : MADlib tools version    = 1.14 (/usr/local/madlib/Versions/1.14/bin/../madpack/madpack.py)
madpack.py: INFO : MADlib database version = None (host=127.0.0.1:5432, db=postgres, schema=madlib)
madpack.py: INFO : Testing PL/Python environment...
madpack.py: INFO : > Creating language PL/Python...
madpack.py: INFO : > PL/Python environment OK (version: 2.7.5)
madpack.py: INFO : Installing MADlib into MADLIB schema...
madpack.py: INFO : > Creating MADLIB schema
madpack.py: INFO : > Creating MADLIB.MigrationHistory table
madpack.py: INFO : > Writing version info in MigrationHistory table
madpack.py: INFO : > Creating objects for modules:
madpack.py: INFO : > - array_ops
madpack.py: INFO : > - bayes
madpack.py: INFO : > - crf
madpack.py: INFO : > - elastic_net
madpack.py: INFO : > - linalg
madpack.py: INFO : > - pmml
madpack.py: INFO : > - prob
madpack.py: INFO : > - sketch
madpack.py: INFO : > - svec
madpack.py: INFO : > - svm
madpack.py: INFO : > - tsa
madpack.py: INFO : > - stemmer
madpack.py: INFO : > - conjugate_gradient
madpack.py: INFO : > - knn
madpack.py: INFO : > - lda
madpack.py: INFO : > - stats
madpack.py: INFO : > - svec_util
madpack.py: INFO : > - utilities
madpack.py: INFO : > - assoc_rules
madpack.py: INFO : > - convex
madpack.py: INFO : > - glm
madpack.py: INFO : > - graph
madpack.py: INFO : > - linear_systems
madpack.py: INFO : > - recursive_partitioning
madpack.py: INFO : > - regress
madpack.py: INFO : > - sample
madpack.py: INFO : > - summary
madpack.py: INFO : > - kmeans
madpack.py: INFO : > - pca
madpack.py: INFO : > - validation
madpack.py: INFO : MADlib 1.14 installed successfully in MADLIB schema.

检查安装

>$ /usr/local/madlib/bin/madpack install-check -p greenplum -c demo@127.0.0.1:5432/postgres 
madpack.py: INFO : Detected Greenplum DB version 5.0.0.
TEST CASE RESULT|Module: array_ops|array_ops.sql_in|PASS|Time: 3645 milliseconds
TEST CASE RESULT|Module: bayes|gaussian_naive_bayes.sql_in|PASS|Time: 2175 milliseconds
TEST CASE RESULT|Module: bayes|bayes.sql_in|PASS|Time: 6984 milliseconds
TEST CASE RESULT|Module: crf|crf_train_small.sql_in|PASS|Time: 2600 milliseconds
TEST CASE RESULT|Module: crf|crf_train_large.sql_in|PASS|Time: 3662 milliseconds
TEST CASE RESULT|Module: crf|crf_test_small.sql_in|PASS|Time: 3070 milliseconds
TEST CASE RESULT|Module: crf|crf_test_large.sql_in|PASS|Time: 2886 milliseconds
TEST CASE RESULT|Module: elastic_net|elastic_net_install_check.sql_in|PASS|Time: 10679 milliseconds
TEST CASE RESULT|Module: linalg|svd.sql_in|PASS|Time: 15686 milliseconds
TEST CASE RESULT|Module: linalg|matrix_ops.sql_in|PASS|Time: 11672 milliseconds
TEST CASE RESULT|Module: linalg|linalg.sql_in|PASS|Time: 645 milliseconds
TEST CASE RESULT|Module: prob|prob.sql_in|PASS|Time: 2617 milliseconds
TEST CASE RESULT|Module: sketch|support.sql_in|PASS|Time: 140 milliseconds
TEST CASE RESULT|Module: sketch|mfv.sql_in|PASS|Time: 449 milliseconds
TEST CASE RESULT|Module: sketch|fm.sql_in|PASS|Time: 3138 milliseconds
TEST CASE RESULT|Module: sketch|cm.sql_in|PASS|Time: 6968 milliseconds
TEST CASE RESULT|Module: svm|svm.sql_in|PASS|Time: 47427 milliseconds
TEST CASE RESULT|Module: tsa|arima_train.sql_in|PASS|Time: 8105 milliseconds
TEST CASE RESULT|Module: tsa|arima.sql_in|PASS|Time: 8881 milliseconds
TEST CASE RESULT|Module: conjugate_gradient|conj_grad.sql_in|PASS|Time: 1023 milliseconds
TEST CASE RESULT|Module: knn|knn.sql_in|PASS|Time: 4071 milliseconds
TEST CASE RESULT|Module: lda|lda.sql_in|PASS|Time: 13918 milliseconds
TEST CASE RESULT|Module: stats|wsr_test.sql_in|PASS|Time: 483 milliseconds
TEST CASE RESULT|Module: stats|t_test.sql_in|PASS|Time: 535 milliseconds
TEST CASE RESULT|Module: stats|robust_and_clustered_variance_coxph.sql_in|PASS|Time: 3388 milliseconds
TEST CASE RESULT|Module: stats|pred_metrics.sql_in|PASS|Time: 3386 milliseconds
TEST CASE RESULT|Module: stats|mw_test.sql_in|PASS|Time: 313 milliseconds
TEST CASE RESULT|Module: stats|ks_test.sql_in|PASS|Time: 434 milliseconds
TEST CASE RESULT|Module: stats|f_test.sql_in|PASS|Time: 253 milliseconds
TEST CASE RESULT|Module: stats|cox_prop_hazards.sql_in|PASS|Time: 7802 milliseconds
TEST CASE RESULT|Module: stats|correlation.sql_in|PASS|Time: 4900 milliseconds
TEST CASE RESULT|Module: stats|chi2_test.sql_in|PASS|Time: 639 milliseconds
TEST CASE RESULT|Module: stats|anova_test.sql_in|PASS|Time: 372 milliseconds
TEST CASE RESULT|Module: svec_util|svec_test.sql_in|PASS|Time: 2495 milliseconds
TEST CASE RESULT|Module: svec_util|gp_sfv_sort_order.sql_in|PASS|Time: 454 milliseconds
TEST CASE RESULT|Module: utilities|text_utilities.sql_in|PASS|Time: 651 milliseconds
TEST CASE RESULT|Module: utilities|sessionize.sql_in|PASS|Time: 552 milliseconds
TEST CASE RESULT|Module: utilities|pivot.sql_in|PASS|Time: 4107 milliseconds
TEST CASE RESULT|Module: utilities|path.sql_in|PASS|Time: 1948 milliseconds
TEST CASE RESULT|Module: utilities|minibatch_preprocessing.sql_in|PASS|Time: 7324 milliseconds
TEST CASE RESULT|Module: utilities|encode_categorical.sql_in|PASS|Time: 1764 milliseconds
TEST CASE RESULT|Module: utilities|drop_madlib_temp.sql_in|PASS|Time: 390 milliseconds
TEST CASE RESULT|Module: assoc_rules|assoc_rules.sql_in|PASS|Time: 6334 milliseconds
TEST CASE RESULT|Module: convex|mlp.sql_in|PASS|Time: 13578 milliseconds
TEST CASE RESULT|Module: convex|lmf.sql_in|PASS|Time: 5267 milliseconds
TEST CASE RESULT|Module: glm|poisson.sql_in|PASS|Time: 4348 milliseconds
TEST CASE RESULT|Module: glm|ordinal.sql_in|PASS|Time: 2583 milliseconds
TEST CASE RESULT|Module: glm|multinom.sql_in|PASS|Time: 3669 milliseconds
TEST CASE RESULT|Module: glm|inverse_gaussian.sql_in|PASS|Time: 4148 milliseconds
TEST CASE RESULT|Module: glm|gaussian.sql_in|PASS|Time: 4124 milliseconds
TEST CASE RESULT|Module: glm|gamma.sql_in|PASS|Time: 8138 milliseconds
TEST CASE RESULT|Module: glm|binomial.sql_in|PASS|Time: 6418 milliseconds
TEST CASE RESULT|Module: graph|wcc.sql_in|PASS|Time: 4071 milliseconds
TEST CASE RESULT|Module: graph|sssp.sql_in|PASS|Time: 7463 milliseconds
TEST CASE RESULT|Module: graph|pagerank.sql_in|PASS|Time: 31027 milliseconds
TEST CASE RESULT|Module: graph|measures.sql_in|PASS|Time: 3481 milliseconds
TEST CASE RESULT|Module: graph|hits.sql_in|PASS|Time: 3985 milliseconds
TEST CASE RESULT|Module: graph|bfs.sql_in|PASS|Time: 7882 milliseconds
TEST CASE RESULT|Module: graph|apsp.sql_in|PASS|Time: 3741 milliseconds
TEST CASE RESULT|Module: linear_systems|sparse_linear_sytems.sql_in|PASS|Time: 985 milliseconds
TEST CASE RESULT|Module: linear_systems|dense_linear_sytems.sql_in|PASS|Time: 955 milliseconds
TEST CASE RESULT|Module: recursive_partitioning|random_forest.sql_in|PASS|Time: 24188 milliseconds
TEST CASE RESULT|Module: recursive_partitioning|decision_tree.sql_in|PASS|Time: 10239 milliseconds
TEST CASE RESULT|Module: regress|robust.sql_in|PASS|Time: 14052 milliseconds
TEST CASE RESULT|Module: regress|multilogistic.sql_in|PASS|Time: 4775 milliseconds
TEST CASE RESULT|Module: regress|marginal.sql_in|PASS|Time: 20274 milliseconds
TEST CASE RESULT|Module: regress|logistic.sql_in|PASS|Time: 13970 milliseconds
TEST CASE RESULT|Module: regress|linear.sql_in|PASS|Time: 1114 milliseconds
TEST CASE RESULT|Module: regress|clustered.sql_in|PASS|Time: 8617 milliseconds
TEST CASE RESULT|Module: sample|train_test_split.sql_in|PASS|Time: 2707 milliseconds
TEST CASE RESULT|Module: sample|stratified_sample.sql_in|PASS|Time: 1659 milliseconds
TEST CASE RESULT|Module: sample|sample.sql_in|PASS|Time: 332 milliseconds
TEST CASE RESULT|Module: sample|balance_sample.sql_in|PASS|Time: 3749 milliseconds
TEST CASE RESULT|Module: summary|summary.sql_in|PASS|Time: 2799 milliseconds
TEST CASE RESULT|Module: kmeans|kmeans.sql_in|PASS|Time: 18560 milliseconds
TEST CASE RESULT|Module: pca|pca_project.sql_in|PASS|Time: 28497 milliseconds
TEST CASE RESULT|Module: pca|pca.sql_in|PASS|Time: 127319 milliseconds
TEST CASE RESULT|Module: validation|cross_validation.sql_in|PASS|Time: 7857 milliseconds

说明: /usr/local/madlib/bin/madpack install -s SCHEMA -p 平台-c 用户@IP:PORT/DATABASE

★SCHEMA验证

postgres=# psql
postgres=# \dn madlib*
         List of schemas
          Name                  | Owner 
--------------------------------+-------
 madlib                         | demo
 madlib_installcheck_tsa        | demo
(2 rows)
postgres=#

★设置search_path

postgres=# set search_path="$user",madlib,public;

★版本信息检查

postgres=# \dt
                  List of relations
 Schema    |       Name           | Type   | Owner   | Storage 
-----------+----------------------+--------+---------+---------
 madlib    | migrationhistory     | table  | demo    | heap
(1 row)
postgres=# select * from migrationhistory;
 id  | version |          applied           
-----+---------+----------------------------
  1  | 1.14    | 2018-07-02 18:02:41.384696
(1 row)

★简单使用举例

参考: https://cwiki.apache.org/confluence/display/MADLIB/Quick+Start+Guide+for+Users
Sample Problem Using Logistic Regression
**1. The sample data set and an introduction to logistic regression are described here.**
The MADlib function used in this example is described in the MADlib logistic regression documentation.
Suppose that we are working with doctors on a project related to heart failure. The dependent variable in the data set is whether the patient has had a second heart attack within 1 year (yes=1). We have two independent variables: one is whether the patient completed a treatment on anger control (yes=1), and the other is a score on a trait anxiety scale (higher score means more anxious).
The idea is to train a model using labeled data, then use this model to predict second heart attack occurrence for other patients.
**2. To interact with the data using MADlib, use the standard psql terminal provided by the database. You could also use a tool like pgAdmin.**
DROP TABLE IF EXISTS patients, patients_logregr, patients_logregr_summary;
CREATE TABLE patients( id INTEGER NOT NULL,
                        second_attack INTEGER,
                        treatment INTEGER,
                        trait_anxiety INTEGER);
INSERT INTO patients VALUES                                                     
(1,     1,      1,      70),
(3,     1,      1,      50),
(5,     1,      0,      40),
(7,     1,      0,      75),
(9,     1,      0,      70),
(11,    0,      1,      65),
(13,    0,      1,      45),
(15,    0,      1,      40),
(17,    0,      0,      55),
(19,    0,      0,      50),
(2,     1,      1,      80),
(4,     1,      0,      60),
(6,     1,      0,      65),
(8,     1,      0,      80),
(10,    1,      0,      60),
(12,    0,      1,      50),
(14,    0,      1,      35),
(16,    0,      1,      50),
(18,    0,      0,      45),
(20,    0,      0,      60);
**3. Call MADlib built-in function to train a classification model using the training data table as input:**
SELECT madlib.logregr_train(
    'patients',                                 -- source table
    'patients_logregr',                         -- output table
    'second_attack',                            -- labels
    'ARRAY[1, treatment, trait_anxiety]',       -- features
    NULL,                                       -- grouping columns
    20,                                         -- max number of iteration
    'irls'                                      -- optimizer
    );
**4. View the model that has just been trained:**
-- Set extended display on for easier reading of output (\x is for psql only)
 \x on
 SELECT * from patients_logregr;
 -- ************ --
 --    Result    --
 -- ************ --
 coef                     | [-6.36346994178187, -1.02410605239327, 0.119044916668606]
 log_likelihood           | -9.41018298389
 std_err                  | [3.21389766375094, 1.17107844860319, 0.0549790458269309]
 z_stats                  | [-1.97998524145759, -0.874498248699549, 2.16527796868918]
 p_values                 | [0.0477051870698128, 0.38184697353045, 0.0303664045046168]
 odds_ratios              | [0.0017233763092323, 0.359117354054954, 1.12642051220895]
 condition_no             | 326.081922792
 num_rows_processed       | 20
 num_missing_rows_skipped | 0
 num_iterations           | 5
 variance_covariance      | [[10.3291381930637, -0.47430466519573, -0.171995901260052], [-0.47430466519573, 1.37142473278285, -0.00119520703381598], [-0.171995901260052, -0.00119520703381598, 0.00302269548003977]]
 -- Alternatively, unnest the arrays in the results for easier reading of output (\x is for psql only)
 \x off
 SELECT unnest(array['intercept', 'treatment', 'trait_anxiety']) as attribute,
        unnest(coef) as coefficient,
        unnest(std_err) as standard_error,
        unnest(z_stats) as z_stat,
        unnest(p_values) as pvalue,
        unnest(odds_ratios) as odds_ratio
 FROM patients_logregr;
 -- ************ --
 --    Result    --
 -- ************ --
 +---------------+---------------+------------------+-----------+-----------+--------------+
 | attribute     |   coefficient |   standard_error |    z_stat |    pvalue |   odds_ratio |
 |---------------+---------------+------------------+-----------+-----------+--------------|
 | intercept     |     -6.36347  |         3.2139   | -1.97999  | 0.0477052 |   0.00172338 |
 | treatment     |     -1.02411  |         1.17108  | -0.874498 | 0.381847  |   0.359117   |
 | trait_anxiety |      0.119045 |         0.054979 |  2.16528  | 0.0303664 |   1.12642    |
 +---------------+---------------+------------------+-----------+-----------+--------------+
**5. Now use the model to predict the dependent variable (second heart attack within 1 year) using the logistic regression model. For the purpose of demonstration, we will use the original data table to perform the prediction. Typically a different test dataset with the same features as the original training dataset would be used for prediction.**
-- Display prediction value along with the original value
SELECT p.id, madlib.logregr_predict(coef, ARRAY[1, treatment, trait_anxiety]),
       p.second_attack
FROM patients p, patients_logregr m
ORDER BY p.id;
-- ************ --
--    Result    --
-- ************ --
+------+-------------------+-----------------+
|   id | logregr_predict   |   second_attack |
|------+-------------------+-----------------|
|    1 | True              |               1 |
|    2 | True              |               1 |
|    3 | False             |               1 |
|    4 | True              |               1 |
|    5 | False             |               1 |
|    6 | True              |               1 |
|    7 | True              |               1 |
|    8 | True              |               1 |
|    9 | True              |               1 |
|   10 | True              |               1 |
|   11 | True              |               0 |
|   12 | False             |               0 |
|   13 | False             |               0 |
|   14 | False             |               0 |
|   15 | False             |               0 |
|   16 | False             |               0 |
|   17 | True              |               0 |
|   18 | False             |               0 |
|   19 | False             |               0 |
|   20 | True              |               0 |
+------+-------------------+-----------------+
-- Predicting the probability of the dependent variable being TRUE.
-- Display prediction value along with the original value
SELECT p.id, madlib.logregr_predict_prob(coef, ARRAY[1, treatment, trait_anxiety])
FROM patients p, patients_logregr m
ORDER BY p.id;
-- ************ --
--    Result    --
-- ************ --
+------+------------------------+
|   id |   logregr_predict_prob |
|------+------------------------|
|    1 |              0.720223  |
|    2 |              0.894355  |
|    3 |              0.19227   |
|    4 |              0.685513  |
|    5 |              0.167748  |
|    6 |              0.798098  |
|    7 |              0.928568  |
|    8 |              0.959306  |
|    9 |              0.877576  |
|   10 |              0.685513  |
|   11 |              0.586701  |
|   12 |              0.19227   |
|   13 |              0.116032  |
|   14 |              0.0383829 |
|   15 |              0.0674976 |
|   16 |              0.19227   |
|   17 |              0.545871  |
|   18 |              0.267675  |
|   19 |              0.398619  |
|   20 |              0.685513  |
+------+------------------------+ 
    The 1 entry in the ARRAY denotes an additional bias term in the model in the standard way, to allow for a non-zero intercept value.
    If the probability is greater than 0.5, the prediction is given as True. Otherwise it is given as False.
Next Steps
• For details on all of the machine learning functions provided by MADlib, please refer to User Documentation
• Try out the available Jupyter notebooks for many MADlib algorithms
• To contribute new modules to MADlib, please refer to the Quick Start Guide for Developers

Apache MADlib for Greenplum安装

用户登录