跳转至

pyml cluster 记录(2024-08)

状态: 📝 草稿

创建日期: 2024-08-01 最后更新: 2026-02-28


参考资源


git 操作

单独 git 分支

# 初始化
git clone -b feature/cluster_20240728 --single-branch git@bitbucket.org:fredxiang/pyml.git

# 更新
git pull origin feature/cluster_20240728

# 强制更新
git fetch origin && git reset --hard origin/feature/cluster_20240728

# 强制更新(GitHub)
git fetch origin && git reset --hard remotes/origin/dev_20241105

服务器信息

cab4~6  root  pass2X67gcU8OB4C
XIN-    gcU8OB2X67gcU

MongoDB 操作

ca12 连接两个 db

mongosh "mongodb://rniRead:A6Nsr058J1yF@ca8:27037/rni?authSource=admin&tls=true&tlsCAFile=%2Fetc%2FmongoRni%2Fca.crt&tlsCertificateKeyFile=%2Fetc%2FmongoRni%2Fserver.pem"

mongosh "mongodb://d1:d1@ca12:27017/listing?authSource=admin&ssl=true&tlsInsecure=true"

其他 SQL 操作

// 查看统计的效果
db.ml_stats_sp.find({ model: { $in: ["sp_mn_result", "sp_md_result"] } }).sort({ _id: -1 }).limit(12)

// 触发 trans_n
db.getSiblingDB('tmp').properties.updateOne(
  { "_id": "TRBN12229940" },
  [{ "$set": { "mt": { "$dateAdd": { "startDate": "$mt", "unit": "millisecond", "amount": 1 } } } }]
);

// 更新 trans 或 predict 的 token
db.ml_sysdata.updateOne(
  { _id: 'resume_predict_n' },
  { $set: { 'data.resume_token': db.ml_sysdata.findOne({ _id: 'resume_trans_n' }).data.resume_token } }
);

登录 cab6:

mongosh mongodb://cab6user:PvUZy5kwqrfN@127.0.0.1:17017/listing?authSource=admin \
  --tls \
  --tlsCAFile /home/pengxin/etc/mongodb/cert/ca.pem \
  --tlsCertificateKeyFile /home/pengxin/etc/mongodb/cert/client.pem

从 PRE 导出到 TEST 环境

主表 properties

PRE 导出:

mongodump -c properties -d tmp -h localhost --port 27017 \
  -u cab4user -p PvUZy5kwqrfN --authenticationDatabase admin --gzip \
  -q='{"prov":"ON"},"onD":{"$gte":20240201}, "ptype": "r"}'

mongodump -c properties -d tmp -h localhost --port 27017 \
  -u cab4user -p PvUZy5kwqrfN --authenticationDatabase admin --gzip \
  -q='{"prov":"ON","city":"Toronto"},"onD":{"$gte":20240201}, "ptype": "r"}'

TEST 导入:

scpdb tmp

cd ~/db/tmp && gunzip -kf properties.metadata.json.gz && gunzip -kf properties.bson.gz

mongorestore "mongodb://r9:r9@100.70.132.131:17017/listing?authSource=admin&authMechanism=SCRAM-SHA-1&tls=true&tlsCAFile=/mnt/mlwork/pengxin/etc/mongodb/cert/ca.pem&tlsCertificateKeyFile=/mnt/mlwork/pengxin/etc/mongodb/cert/client.pem" \
  --db listing --collection properties --drop properties.bson

mongorestore "mongodb://cab6user:PvUZy5kwqrfN@127.0.0.1:17017/tmp?authSource=admin" \
  --db tmp --collection properties --drop properties.bson

其他 rni 表

PRE 导出(ml_props_cleaned 表):

mongodump -c ml_props_cleaned -d rni -h localhost --port 27017 \
  -u cab4user -p PvUZy5kwqrfN --authenticationDatabase admin --gzip \
  -q='{"prov":"ON"},"onD":{"$gte":20240201}}'

mongodump -c ml_props_cleaned -d rni -h localhost --port 27017 \
  -u cab4user -p PvUZy5kwqrfN --authenticationDatabase admin --gzip \
  -q='{"prov":"ON","city": "ON:Toronto"},"onD":{"$gte":20240201}}'

mongodump -c ml_sp_history -d rni -h localhost --port 27017 \
  -u cab4user -p PvUZy5kwqrfN --authenticationDatabase admin --gzip

PRE 导出(所有表):

for collection in ml_fs.chunks ml_fs.files ml_props_cleaned ml_sysdata; do
  mongodump -c "$collection" -d rni -h localhost --port 27017 \
    -u cab4user -p PvUZy5kwqrfN --authenticationDatabase admin --gzip
done

TEST 导入(ml_props_cleaned 表):

scpdb rni

cd ~/db/rni && gunzip -kf ml_props_cleaned.metadata.json.gz && gunzip -kf ml_props_cleaned.bson.gz

mongorestore "mongodb://r9:r9@100.70.132.131:17017/rni?authSource=admin&authMechanism=SCRAM-SHA-1&tls=true&tlsCAFile=/mnt/mlwork/pengxin/etc/mongodb/cert/ca.pem&tlsCertificateKeyFile=/mnt/mlwork/pengxin/etc/mongodb/cert/client.pem" \
  --db rni --collection ml_props_cleaned --drop ml_props_cleaned.bson

mongorestore "mongodb://cab6user:PvUZy5kwqrfN@127.0.0.1:17017/rni?authSource=admin" \
  --db rni --collection ml_props_cleaned --drop ml_props_cleaned.bson

mongorestore "mongodb://r9:r9@100.70.132.131:17017/rni?authSource=admin&authMechanism=SCRAM-SHA-1&tls=true&tlsCAFile=/mnt/mlwork/pengxin/etc/mongodb/cert/ca.pem&tlsCertificateKeyFile=/mnt/mlwork/pengxin/etc/mongodb/cert/client.pem" \
  --db rni --collection ml_sp_history --drop ml_sp_history.bson

TEST 导入(所有表):

cd ~/db/rni && gunzip -kf *.gz

for collection in ml_fs.chunks ml_fs.files ml_props_cleaned ml_sysdata; do
  mongorestore "mongodb://r9:r9@100.70.132.131:17017/rni?authSource=admin&authMechanism=SCRAM-SHA-1&tls=true&tlsCAFile=/mnt/mlwork/pengxin/etc/mongodb/cert/ca.pem&tlsCertificateKeyFile=/mnt/mlwork/pengxin/etc/mongodb/cert/client.pem" \
    --db rni --collection "$collection" --drop "$collection".bson
done

导出部分城市的 MongoDB 数据

# 导出部分
mongodump -c properties -d listing -h ca12 --port 27017 \
  -u d1 -p d1 --authenticationDatabase admin --gzip \
  --ssl --tlsInsecure --sslCAFile '/etc/mongo/ca.crt' \
  -q='{"city":{"$in":["Peel","Richmond"]},"onD":{"$gte":20240601}, "ptype": "r"}'

# 导出所有
mongodump -c properties -d listing -h ca12 --port 27017 \
  -u d1 -p d1 --authenticationDatabase admin --gzip \
  --ssl --tlsInsecure --sslCAFile '/etc/mongo/ca.crt' \
  -q='{"prov":{"$in":["BC","ON","AB"]},"onD":{"$gte":20230201}, "ptype": "r"}'

# 数据转 GPU
scprlt /mnt/mlwork/pengxin /opt/pengxin/dump/listing

# 解压缩
gunzip properties.bson.gz && gunzip properties.metadata.json.gz

# 导入部分
mongorestore --host 127.0.0.1 --port 27017 \
  --db listing --collection properties --drop \
  ~/Downloads/listing/properties.bson

# 导入所有
mongorestore "mongodb://r9:r9@100.70.132.131:17017/listing?authSource=admin&authMechanism=SCRAM-SHA-1&tls=true&tlsCAFile=/mnt/mlwork/pengxin/etc/mongodb/cert/ca.pem&tlsCertificateKeyFile=/mnt/mlwork/pengxin/etc/mongodb/cert/client.pem" \
  --db listing --collection properties --drop properties.bson

# 本地导入到远程
mongorestore "mongodb://r9:r9@100.70.132.131:17017/pyml?authSource=admin&authMechanism=SCRAM-SHA-1&tls=true&tlsCAFile=/Users/pengxin/data/real/pyml/cert/ca.pem&tlsCertificateKeyFile=/Users/pengxin/data/real/pyml/cert/client.pem" \
  --db pyml --collection ml2_fs.files --drop \
  /Users/pengxin/Downloads/print/ml2_fs.files.bson

更新最后一周的数据为未 sold

// 更新结果: 5254 条
db.properties.updateMany(
  { 'ptype': 'r', 'status': 'U', 'prov': { '$in': ['BC', 'ON'] }, 'offD': { '$gt': 20240810 }, 'sp': { '$exists': true } },
  [
    { $set: { status: "A", sp_m0: "$sp", offD_m0: "$offD", 'lst': 'Pnd' } },
    { $unset: ["sp", "offD"] }
  ]
);

// 最新数据
db.properties.find({ 'city': 'Toronto' }, { 'onD': 1 }).sort({ onD: -1 }).limit(1)

// 统计行数
db.properties.countDocuments({ 'ptype': 'r', 'status': 'A', 'prov': { '$in': ['BC', 'ON'] }, 'sp_m0': { '$exists': true } })

PyCharm 查询条件:

{ 'ptype': 'r', 'status': 'A', 'prov': { '$in': ['BC', 'ON'] }, 'sp_m0': { '$exists': true } },
{ 'sp_m11': 1, 'sp_m12': 1, 'sp_m13': 1, 'sp_m19': 1, 'lp': 1, 'sp': 1, 'offD': 1, 'sp_m0': 1 }

重置预测结果

db.properties.updateMany({}, { $unset: {
  sp_m1: "", sp_m1_acu: "", sp_m2: "", sp_m2_acu: "",
  sp_m3: "", sp_m3_acu: "", sp_m4: "", sp_m4_acu: "",
  sp_m9: "", sp_m9_acu: ""
}})

db.properties.updateMany({}, { $unset: {
  sp_m11: "", sp_m11_acu: "", sp_m12: "", sp_m12_acu: "",
  sp_m13: "", sp_m13_acu: "", sp_m14: "", sp_m14_acu: "",
  sp_m19: "", sp_m19_acu: ""
}})

db.properties.updateMany({}, { $unset: {
  sp_m1: "", sp_m1_acu: "", sp_m2: "", sp_m2_acu: "",
  sp_m3: "", sp_m3_acu: "", sp_m4: "", sp_m4_acu: "",
  sp_m9: "", sp_m9_acu: "",
  sp_m11: "", sp_m11_acu: "", sp_m12: "", sp_m12_acu: "",
  sp_m13: "", sp_m13_acu: "", sp_m14: "", sp_m14_acu: "",
  sp_m19: "", sp_m19_acu: "",
  sp_m6: "", sp_m6_acu: "", sp_m10: "",
  sp_m15: "", sp_m16: "", sp_m17: "", sp_m18: ""
}})

找出评估值差距大的房源

db.ml_props_cleaned.aggregate([
  {
    $match: {
      prov: "ON",
      sp: { $gt: 0 },
      spcts: { $gte: ISODate("2025-12-20T00:00:00Z") },
      status: "U",
      "saletp-b": "S"
    }
  },
  {
    $project: {
      _id: 1, offD: 1, spcts: 1, sp: 1, sp_md_result: 1,
      sp_ratio: { $divide: ["$sp_md_result", "$sp"] }
    }
  },
  {
    $match: {
      $or: [{ sp_ratio: { $gt: 1.2 } }, { sp_ratio: { $lt: 0.8 } }]
    }
  }
]).forEach(doc => {
  print(`_id:${doc._id}, offD:${doc.offD}, spcts:${doc.spcts}, sp:${doc.sp}, sp_md_result:${doc.sp_md_result}, sp_ratio:${doc.sp_ratio}`);
});

ES (Elasticsearch) 使用技巧

调试:把参数带入 template 只打印编译后查询语句

POST /_render/template
{
  "id": "knn_search_template",
  "params": {
    "city_c": 1024,
    "distance": 2000000,
    "exclude_id": "DDF2749832100006",
    "lat": 43.6638422,
    "lon": -81.0236597,
    "ptype2_l_c": 3,
    "query_vector": [0.1, 0.1, 0.0, 1.0, 0.2, 0.8],
    "type_is_sold": true,
    "type_is_unsold": false
  }
}

其他