注意: 此插件只是在测试环境论证过,如若在生产环境使用,请自行评估风险
参考:
https://yq.aliyun.com/articles/157
https://blog.csdn.net/wzy0623/article/details/78845020
★下载对应的版本的rpm包
http://madlib.apache.org/
★RPM安装
1.确认madlib是否存在,存在则执行步骤2># rpm -qa | grep madlib2. 卸载madlib包># rpm -e madlib-1.14-1.x86_643. 安装madlib包>#rpm -ivh apache-madlib-1.14-bin-Linux.rpm
★确认安装的平台信息
># su - demo>$ psql postgrespostgres=# select version();version------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------PostgreSQL 8.3.23 (Greenplum Database 5.0.0 build dev) on x86_64-pc-linux-gnu, compiled by GCC gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11), 64-bit compiled on Jul 2 201817:14:09 (with assert checking)(1 row)postgres=#
★安装
># su - demo>$ /usr/local/madlib/bin/madpack install -s madlib -p greenplum -c demo@127.0.0.1:5432/postgresmadpack.py: INFO : Detected Greenplum DB version 5.0.0.madpack.py: INFO : *** Installing MADlib ***madpack.py: INFO : MADlib tools version = 1.14 (/usr/local/madlib/Versions/1.14/bin/../madpack/madpack.py)madpack.py: INFO : MADlib database version = None (host=127.0.0.1:5432, db=postgres, schema=madlib)madpack.py: INFO : Testing PL/Python environment...madpack.py: INFO : > Creating language PL/Python...madpack.py: INFO : > PL/Python environment OK (version: 2.7.5)madpack.py: INFO : Installing MADlib into MADLIB schema...madpack.py: INFO : > Creating MADLIB schemamadpack.py: INFO : > Creating MADLIB.MigrationHistory tablemadpack.py: INFO : > Writing version info in MigrationHistory tablemadpack.py: INFO : > Creating objects for modules:madpack.py: INFO : > - array_opsmadpack.py: INFO : > - bayesmadpack.py: INFO : > - crfmadpack.py: INFO : > - elastic_netmadpack.py: INFO : > - linalgmadpack.py: INFO : > - pmmlmadpack.py: INFO : > - probmadpack.py: INFO : > - sketchmadpack.py: INFO : > - svecmadpack.py: INFO : > - svmmadpack.py: INFO : > - tsamadpack.py: INFO : > - stemmermadpack.py: INFO : > - conjugate_gradientmadpack.py: INFO : > - knnmadpack.py: INFO : > - ldamadpack.py: INFO : > - statsmadpack.py: INFO : > - svec_utilmadpack.py: INFO : > - utilitiesmadpack.py: INFO : > - assoc_rulesmadpack.py: INFO : > - convexmadpack.py: INFO : > - glmmadpack.py: INFO : > - graphmadpack.py: INFO : > - linear_systemsmadpack.py: INFO : > - recursive_partitioningmadpack.py: INFO : > - regressmadpack.py: INFO : > - samplemadpack.py: INFO : > - summarymadpack.py: INFO : > - kmeansmadpack.py: INFO : > - pcamadpack.py: INFO : > - validationmadpack.py: INFO : MADlib 1.14 installed successfully in MADLIB schema.
检查安装
>$ /usr/local/madlib/bin/madpack install-check -p greenplum -c demo@127.0.0.1:5432/postgresmadpack.py: INFO : Detected Greenplum DB version 5.0.0.TEST CASE RESULT|Module: array_ops|array_ops.sql_in|PASS|Time: 3645 millisecondsTEST CASE RESULT|Module: bayes|gaussian_naive_bayes.sql_in|PASS|Time: 2175 millisecondsTEST CASE RESULT|Module: bayes|bayes.sql_in|PASS|Time: 6984 millisecondsTEST CASE RESULT|Module: crf|crf_train_small.sql_in|PASS|Time: 2600 millisecondsTEST CASE RESULT|Module: crf|crf_train_large.sql_in|PASS|Time: 3662 millisecondsTEST CASE RESULT|Module: crf|crf_test_small.sql_in|PASS|Time: 3070 millisecondsTEST CASE RESULT|Module: crf|crf_test_large.sql_in|PASS|Time: 2886 millisecondsTEST CASE RESULT|Module: elastic_net|elastic_net_install_check.sql_in|PASS|Time: 10679 millisecondsTEST CASE RESULT|Module: linalg|svd.sql_in|PASS|Time: 15686 millisecondsTEST CASE RESULT|Module: linalg|matrix_ops.sql_in|PASS|Time: 11672 millisecondsTEST CASE RESULT|Module: linalg|linalg.sql_in|PASS|Time: 645 millisecondsTEST CASE RESULT|Module: prob|prob.sql_in|PASS|Time: 2617 millisecondsTEST CASE RESULT|Module: sketch|support.sql_in|PASS|Time: 140 millisecondsTEST CASE RESULT|Module: sketch|mfv.sql_in|PASS|Time: 449 millisecondsTEST CASE RESULT|Module: sketch|fm.sql_in|PASS|Time: 3138 millisecondsTEST CASE RESULT|Module: sketch|cm.sql_in|PASS|Time: 6968 millisecondsTEST CASE RESULT|Module: svm|svm.sql_in|PASS|Time: 47427 millisecondsTEST CASE RESULT|Module: tsa|arima_train.sql_in|PASS|Time: 8105 millisecondsTEST CASE RESULT|Module: tsa|arima.sql_in|PASS|Time: 8881 millisecondsTEST CASE RESULT|Module: conjugate_gradient|conj_grad.sql_in|PASS|Time: 1023 millisecondsTEST CASE RESULT|Module: knn|knn.sql_in|PASS|Time: 4071 millisecondsTEST CASE RESULT|Module: lda|lda.sql_in|PASS|Time: 13918 millisecondsTEST CASE RESULT|Module: stats|wsr_test.sql_in|PASS|Time: 483 millisecondsTEST CASE RESULT|Module: stats|t_test.sql_in|PASS|Time: 535 millisecondsTEST CASE RESULT|Module: stats|robust_and_clustered_variance_coxph.sql_in|PASS|Time: 3388 millisecondsTEST CASE RESULT|Module: stats|pred_metrics.sql_in|PASS|Time: 3386 millisecondsTEST CASE RESULT|Module: stats|mw_test.sql_in|PASS|Time: 313 millisecondsTEST CASE RESULT|Module: stats|ks_test.sql_in|PASS|Time: 434 millisecondsTEST CASE RESULT|Module: stats|f_test.sql_in|PASS|Time: 253 millisecondsTEST CASE RESULT|Module: stats|cox_prop_hazards.sql_in|PASS|Time: 7802 millisecondsTEST CASE RESULT|Module: stats|correlation.sql_in|PASS|Time: 4900 millisecondsTEST CASE RESULT|Module: stats|chi2_test.sql_in|PASS|Time: 639 millisecondsTEST CASE RESULT|Module: stats|anova_test.sql_in|PASS|Time: 372 millisecondsTEST CASE RESULT|Module: svec_util|svec_test.sql_in|PASS|Time: 2495 millisecondsTEST CASE RESULT|Module: svec_util|gp_sfv_sort_order.sql_in|PASS|Time: 454 millisecondsTEST CASE RESULT|Module: utilities|text_utilities.sql_in|PASS|Time: 651 millisecondsTEST CASE RESULT|Module: utilities|sessionize.sql_in|PASS|Time: 552 millisecondsTEST CASE RESULT|Module: utilities|pivot.sql_in|PASS|Time: 4107 millisecondsTEST CASE RESULT|Module: utilities|path.sql_in|PASS|Time: 1948 millisecondsTEST CASE RESULT|Module: utilities|minibatch_preprocessing.sql_in|PASS|Time: 7324 millisecondsTEST CASE RESULT|Module: utilities|encode_categorical.sql_in|PASS|Time: 1764 millisecondsTEST CASE RESULT|Module: utilities|drop_madlib_temp.sql_in|PASS|Time: 390 millisecondsTEST CASE RESULT|Module: assoc_rules|assoc_rules.sql_in|PASS|Time: 6334 millisecondsTEST CASE RESULT|Module: convex|mlp.sql_in|PASS|Time: 13578 millisecondsTEST CASE RESULT|Module: convex|lmf.sql_in|PASS|Time: 5267 millisecondsTEST CASE RESULT|Module: glm|poisson.sql_in|PASS|Time: 4348 millisecondsTEST CASE RESULT|Module: glm|ordinal.sql_in|PASS|Time: 2583 millisecondsTEST CASE RESULT|Module: glm|multinom.sql_in|PASS|Time: 3669 millisecondsTEST CASE RESULT|Module: glm|inverse_gaussian.sql_in|PASS|Time: 4148 millisecondsTEST CASE RESULT|Module: glm|gaussian.sql_in|PASS|Time: 4124 millisecondsTEST CASE RESULT|Module: glm|gamma.sql_in|PASS|Time: 8138 millisecondsTEST CASE RESULT|Module: glm|binomial.sql_in|PASS|Time: 6418 millisecondsTEST CASE RESULT|Module: graph|wcc.sql_in|PASS|Time: 4071 millisecondsTEST CASE RESULT|Module: graph|sssp.sql_in|PASS|Time: 7463 millisecondsTEST CASE RESULT|Module: graph|pagerank.sql_in|PASS|Time: 31027 millisecondsTEST CASE RESULT|Module: graph|measures.sql_in|PASS|Time: 3481 millisecondsTEST CASE RESULT|Module: graph|hits.sql_in|PASS|Time: 3985 millisecondsTEST CASE RESULT|Module: graph|bfs.sql_in|PASS|Time: 7882 millisecondsTEST CASE RESULT|Module: graph|apsp.sql_in|PASS|Time: 3741 millisecondsTEST CASE RESULT|Module: linear_systems|sparse_linear_sytems.sql_in|PASS|Time: 985 millisecondsTEST CASE RESULT|Module: linear_systems|dense_linear_sytems.sql_in|PASS|Time: 955 millisecondsTEST CASE RESULT|Module: recursive_partitioning|random_forest.sql_in|PASS|Time: 24188 millisecondsTEST CASE RESULT|Module: recursive_partitioning|decision_tree.sql_in|PASS|Time: 10239 millisecondsTEST CASE RESULT|Module: regress|robust.sql_in|PASS|Time: 14052 millisecondsTEST CASE RESULT|Module: regress|multilogistic.sql_in|PASS|Time: 4775 millisecondsTEST CASE RESULT|Module: regress|marginal.sql_in|PASS|Time: 20274 millisecondsTEST CASE RESULT|Module: regress|logistic.sql_in|PASS|Time: 13970 millisecondsTEST CASE RESULT|Module: regress|linear.sql_in|PASS|Time: 1114 millisecondsTEST CASE RESULT|Module: regress|clustered.sql_in|PASS|Time: 8617 millisecondsTEST CASE RESULT|Module: sample|train_test_split.sql_in|PASS|Time: 2707 millisecondsTEST CASE RESULT|Module: sample|stratified_sample.sql_in|PASS|Time: 1659 millisecondsTEST CASE RESULT|Module: sample|sample.sql_in|PASS|Time: 332 millisecondsTEST CASE RESULT|Module: sample|balance_sample.sql_in|PASS|Time: 3749 millisecondsTEST CASE RESULT|Module: summary|summary.sql_in|PASS|Time: 2799 millisecondsTEST CASE RESULT|Module: kmeans|kmeans.sql_in|PASS|Time: 18560 millisecondsTEST CASE RESULT|Module: pca|pca_project.sql_in|PASS|Time: 28497 millisecondsTEST CASE RESULT|Module: pca|pca.sql_in|PASS|Time: 127319 millisecondsTEST CASE RESULT|Module: validation|cross_validation.sql_in|PASS|Time: 7857 milliseconds
说明: /usr/local/madlib/bin/madpack install -s SCHEMA -p 平台-c 用户@IP:PORT/DATABASE
★SCHEMA验证
postgres=# psqlpostgres=# \dn madlib*List of schemasName | Owner--------------------------------+-------madlib | demomadlib_installcheck_tsa | demo(2 rows)postgres=#
★设置search_path
postgres=# set search_path="$user",madlib,public;
★版本信息检查
postgres=# \dtList of relationsSchema | Name | Type | Owner | Storage-----------+----------------------+--------+---------+---------madlib | migrationhistory | table | demo | heap(1 row)postgres=# select * from migrationhistory;id | version | applied-----+---------+----------------------------1 | 1.14 | 2018-07-02 18:02:41.384696(1 row)
★简单使用举例
参考: https://cwiki.apache.org/confluence/display/MADLIB/Quick+Start+Guide+for+UsersSample Problem Using Logistic Regression**1. The sample data set and an introduction to logistic regression are described here.**The MADlib function used in this example is described in the MADlib logistic regression documentation.Suppose that we are working with doctors on a project related to heart failure. The dependent variable in the data set is whether the patient has had a second heart attack within 1 year (yes=1). We have two independent variables: one is whether the patient completed a treatment on anger control (yes=1), and the other is a score on a trait anxiety scale (higher score means more anxious).The idea is to train a model using labeled data, then use this model to predict second heart attack occurrence for other patients.**2. To interact with the data using MADlib, use the standard psql terminal provided by the database. You could also use a tool like pgAdmin.**DROP TABLE IF EXISTS patients, patients_logregr, patients_logregr_summary;CREATE TABLE patients( id INTEGER NOT NULL,second_attack INTEGER,treatment INTEGER,trait_anxiety INTEGER);INSERT INTO patients VALUES(1, 1, 1, 70),(3, 1, 1, 50),(5, 1, 0, 40),(7, 1, 0, 75),(9, 1, 0, 70),(11, 0, 1, 65),(13, 0, 1, 45),(15, 0, 1, 40),(17, 0, 0, 55),(19, 0, 0, 50),(2, 1, 1, 80),(4, 1, 0, 60),(6, 1, 0, 65),(8, 1, 0, 80),(10, 1, 0, 60),(12, 0, 1, 50),(14, 0, 1, 35),(16, 0, 1, 50),(18, 0, 0, 45),(20, 0, 0, 60);**3. Call MADlib built-in function to train a classification model using the training data table as input:**SELECT madlib.logregr_train('patients', -- source table'patients_logregr', -- output table'second_attack', -- labels'ARRAY[1, treatment, trait_anxiety]', -- featuresNULL, -- grouping columns20, -- max number of iteration'irls' -- optimizer);**4. View the model that has just been trained:**-- Set extended display on for easier reading of output (\x is for psql only)\x onSELECT * from patients_logregr;-- ************ ---- Result ---- ************ --coef | [-6.36346994178187, -1.02410605239327, 0.119044916668606]log_likelihood | -9.41018298389std_err | [3.21389766375094, 1.17107844860319, 0.0549790458269309]z_stats | [-1.97998524145759, -0.874498248699549, 2.16527796868918]p_values | [0.0477051870698128, 0.38184697353045, 0.0303664045046168]odds_ratios | [0.0017233763092323, 0.359117354054954, 1.12642051220895]condition_no | 326.081922792num_rows_processed | 20num_missing_rows_skipped | 0num_iterations | 5variance_covariance | [[10.3291381930637, -0.47430466519573, -0.171995901260052], [-0.47430466519573, 1.37142473278285, -0.00119520703381598], [-0.171995901260052, -0.00119520703381598, 0.00302269548003977]]-- Alternatively, unnest the arrays in the results for easier reading of output (\x is for psql only)\x offSELECT unnest(array['intercept', 'treatment', 'trait_anxiety']) as attribute,unnest(coef) as coefficient,unnest(std_err) as standard_error,unnest(z_stats) as z_stat,unnest(p_values) as pvalue,unnest(odds_ratios) as odds_ratioFROM patients_logregr;-- ************ ---- Result ---- ************ --+---------------+---------------+------------------+-----------+-----------+--------------+| attribute | coefficient | standard_error | z_stat | pvalue | odds_ratio ||---------------+---------------+------------------+-----------+-----------+--------------|| intercept | -6.36347 | 3.2139 | -1.97999 | 0.0477052 | 0.00172338 || treatment | -1.02411 | 1.17108 | -0.874498 | 0.381847 | 0.359117 || trait_anxiety | 0.119045 | 0.054979 | 2.16528 | 0.0303664 | 1.12642 |+---------------+---------------+------------------+-----------+-----------+--------------+**5. Now use the model to predict the dependent variable (second heart attack within 1 year) using the logistic regression model. For the purpose of demonstration, we will use the original data table to perform the prediction. Typically a different test dataset with the same features as the original training dataset would be used for prediction.**-- Display prediction value along with the original valueSELECT p.id, madlib.logregr_predict(coef, ARRAY[1, treatment, trait_anxiety]),p.second_attackFROM patients p, patients_logregr mORDER BY p.id;-- ************ ---- Result ---- ************ --+------+-------------------+-----------------+| id | logregr_predict | second_attack ||------+-------------------+-----------------|| 1 | True | 1 || 2 | True | 1 || 3 | False | 1 || 4 | True | 1 || 5 | False | 1 || 6 | True | 1 || 7 | True | 1 || 8 | True | 1 || 9 | True | 1 || 10 | True | 1 || 11 | True | 0 || 12 | False | 0 || 13 | False | 0 || 14 | False | 0 || 15 | False | 0 || 16 | False | 0 || 17 | True | 0 || 18 | False | 0 || 19 | False | 0 || 20 | True | 0 |+------+-------------------+-----------------+-- Predicting the probability of the dependent variable being TRUE.-- Display prediction value along with the original valueSELECT p.id, madlib.logregr_predict_prob(coef, ARRAY[1, treatment, trait_anxiety])FROM patients p, patients_logregr mORDER BY p.id;-- ************ ---- Result ---- ************ --+------+------------------------+| id | logregr_predict_prob ||------+------------------------|| 1 | 0.720223 || 2 | 0.894355 || 3 | 0.19227 || 4 | 0.685513 || 5 | 0.167748 || 6 | 0.798098 || 7 | 0.928568 || 8 | 0.959306 || 9 | 0.877576 || 10 | 0.685513 || 11 | 0.586701 || 12 | 0.19227 || 13 | 0.116032 || 14 | 0.0383829 || 15 | 0.0674976 || 16 | 0.19227 || 17 | 0.545871 || 18 | 0.267675 || 19 | 0.398619 || 20 | 0.685513 |+------+------------------------+The 1 entry in the ARRAY denotes an additional bias term in the model in the standard way, to allow for a non-zero intercept value.If the probability is greater than 0.5, the prediction is given as True. Otherwise it is given as False.Next Steps• For details on all of the machine learning functions provided by MADlib, please refer to User Documentation• Try out the available Jupyter notebooks for many MADlib algorithms• To contribute new modules to MADlib, please refer to the Quick Start Guide for Developers
本站文章,未经作者同意,请勿转载,如需转载,请邮件customer@csudata.com.
                
                        0 评论  
                    
                    
                
                        添加一条新评论