pyml cluster 记录(2024-08)¶
状态: 📝 草稿
创建日期: 2024-08-01 最后更新: 2026-02-28
参考资源¶
- 房产图片识别效果参考:https://restb.ai/solutions/property-condition/
- htop 替代 top;nvtop 查看 NVIDIA 的信息
- 内部预览房产信息:https://realmaster.com/1.5/prop/detail/inapp?id=EDME4021039(需加 cookie:name=
apsv,value=appDebug,cookie 有效期写久一点)
git 操作¶
单独 git 分支¶
# 初始化
git clone -b feature/cluster_20240728 --single-branch git@bitbucket.org:fredxiang/pyml.git
# 更新
git pull origin feature/cluster_20240728
# 强制更新
git fetch origin && git reset --hard origin/feature/cluster_20240728
# 强制更新(GitHub)
git fetch origin && git reset --hard remotes/origin/dev_20241105
服务器信息¶
MongoDB 操作¶
ca12 连接两个 db¶
mongosh "mongodb://rniRead:A6Nsr058J1yF@ca8:27037/rni?authSource=admin&tls=true&tlsCAFile=%2Fetc%2FmongoRni%2Fca.crt&tlsCertificateKeyFile=%2Fetc%2FmongoRni%2Fserver.pem"
mongosh "mongodb://d1:d1@ca12:27017/listing?authSource=admin&ssl=true&tlsInsecure=true"
其他 SQL 操作¶
// 查看统计的效果
db.ml_stats_sp.find({ model: { $in: ["sp_mn_result", "sp_md_result"] } }).sort({ _id: -1 }).limit(12)
// 触发 trans_n
db.getSiblingDB('tmp').properties.updateOne(
{ "_id": "TRBN12229940" },
[{ "$set": { "mt": { "$dateAdd": { "startDate": "$mt", "unit": "millisecond", "amount": 1 } } } }]
);
// 更新 trans 或 predict 的 token
db.ml_sysdata.updateOne(
{ _id: 'resume_predict_n' },
{ $set: { 'data.resume_token': db.ml_sysdata.findOne({ _id: 'resume_trans_n' }).data.resume_token } }
);
登录 cab6:
mongosh mongodb://cab6user:PvUZy5kwqrfN@127.0.0.1:17017/listing?authSource=admin \
--tls \
--tlsCAFile /home/pengxin/etc/mongodb/cert/ca.pem \
--tlsCertificateKeyFile /home/pengxin/etc/mongodb/cert/client.pem
从 PRE 导出到 TEST 环境¶
主表 properties¶
PRE 导出:
mongodump -c properties -d tmp -h localhost --port 27017 \
-u cab4user -p PvUZy5kwqrfN --authenticationDatabase admin --gzip \
-q='{"prov":"ON"},"onD":{"$gte":20240201}, "ptype": "r"}'
mongodump -c properties -d tmp -h localhost --port 27017 \
-u cab4user -p PvUZy5kwqrfN --authenticationDatabase admin --gzip \
-q='{"prov":"ON","city":"Toronto"},"onD":{"$gte":20240201}, "ptype": "r"}'
TEST 导入:
scpdb tmp
cd ~/db/tmp && gunzip -kf properties.metadata.json.gz && gunzip -kf properties.bson.gz
mongorestore "mongodb://r9:r9@100.70.132.131:17017/listing?authSource=admin&authMechanism=SCRAM-SHA-1&tls=true&tlsCAFile=/mnt/mlwork/pengxin/etc/mongodb/cert/ca.pem&tlsCertificateKeyFile=/mnt/mlwork/pengxin/etc/mongodb/cert/client.pem" \
--db listing --collection properties --drop properties.bson
mongorestore "mongodb://cab6user:PvUZy5kwqrfN@127.0.0.1:17017/tmp?authSource=admin" \
--db tmp --collection properties --drop properties.bson
其他 rni 表¶
PRE 导出(ml_props_cleaned 表):
mongodump -c ml_props_cleaned -d rni -h localhost --port 27017 \
-u cab4user -p PvUZy5kwqrfN --authenticationDatabase admin --gzip \
-q='{"prov":"ON"},"onD":{"$gte":20240201}}'
mongodump -c ml_props_cleaned -d rni -h localhost --port 27017 \
-u cab4user -p PvUZy5kwqrfN --authenticationDatabase admin --gzip \
-q='{"prov":"ON","city": "ON:Toronto"},"onD":{"$gte":20240201}}'
mongodump -c ml_sp_history -d rni -h localhost --port 27017 \
-u cab4user -p PvUZy5kwqrfN --authenticationDatabase admin --gzip
PRE 导出(所有表):
for collection in ml_fs.chunks ml_fs.files ml_props_cleaned ml_sysdata; do
mongodump -c "$collection" -d rni -h localhost --port 27017 \
-u cab4user -p PvUZy5kwqrfN --authenticationDatabase admin --gzip
done
TEST 导入(ml_props_cleaned 表):
scpdb rni
cd ~/db/rni && gunzip -kf ml_props_cleaned.metadata.json.gz && gunzip -kf ml_props_cleaned.bson.gz
mongorestore "mongodb://r9:r9@100.70.132.131:17017/rni?authSource=admin&authMechanism=SCRAM-SHA-1&tls=true&tlsCAFile=/mnt/mlwork/pengxin/etc/mongodb/cert/ca.pem&tlsCertificateKeyFile=/mnt/mlwork/pengxin/etc/mongodb/cert/client.pem" \
--db rni --collection ml_props_cleaned --drop ml_props_cleaned.bson
mongorestore "mongodb://cab6user:PvUZy5kwqrfN@127.0.0.1:17017/rni?authSource=admin" \
--db rni --collection ml_props_cleaned --drop ml_props_cleaned.bson
mongorestore "mongodb://r9:r9@100.70.132.131:17017/rni?authSource=admin&authMechanism=SCRAM-SHA-1&tls=true&tlsCAFile=/mnt/mlwork/pengxin/etc/mongodb/cert/ca.pem&tlsCertificateKeyFile=/mnt/mlwork/pengxin/etc/mongodb/cert/client.pem" \
--db rni --collection ml_sp_history --drop ml_sp_history.bson
TEST 导入(所有表):
cd ~/db/rni && gunzip -kf *.gz
for collection in ml_fs.chunks ml_fs.files ml_props_cleaned ml_sysdata; do
mongorestore "mongodb://r9:r9@100.70.132.131:17017/rni?authSource=admin&authMechanism=SCRAM-SHA-1&tls=true&tlsCAFile=/mnt/mlwork/pengxin/etc/mongodb/cert/ca.pem&tlsCertificateKeyFile=/mnt/mlwork/pengxin/etc/mongodb/cert/client.pem" \
--db rni --collection "$collection" --drop "$collection".bson
done
导出部分城市的 MongoDB 数据¶
# 导出部分
mongodump -c properties -d listing -h ca12 --port 27017 \
-u d1 -p d1 --authenticationDatabase admin --gzip \
--ssl --tlsInsecure --sslCAFile '/etc/mongo/ca.crt' \
-q='{"city":{"$in":["Peel","Richmond"]},"onD":{"$gte":20240601}, "ptype": "r"}'
# 导出所有
mongodump -c properties -d listing -h ca12 --port 27017 \
-u d1 -p d1 --authenticationDatabase admin --gzip \
--ssl --tlsInsecure --sslCAFile '/etc/mongo/ca.crt' \
-q='{"prov":{"$in":["BC","ON","AB"]},"onD":{"$gte":20230201}, "ptype": "r"}'
# 数据转 GPU
scprlt /mnt/mlwork/pengxin /opt/pengxin/dump/listing
# 解压缩
gunzip properties.bson.gz && gunzip properties.metadata.json.gz
# 导入部分
mongorestore --host 127.0.0.1 --port 27017 \
--db listing --collection properties --drop \
~/Downloads/listing/properties.bson
# 导入所有
mongorestore "mongodb://r9:r9@100.70.132.131:17017/listing?authSource=admin&authMechanism=SCRAM-SHA-1&tls=true&tlsCAFile=/mnt/mlwork/pengxin/etc/mongodb/cert/ca.pem&tlsCertificateKeyFile=/mnt/mlwork/pengxin/etc/mongodb/cert/client.pem" \
--db listing --collection properties --drop properties.bson
# 本地导入到远程
mongorestore "mongodb://r9:r9@100.70.132.131:17017/pyml?authSource=admin&authMechanism=SCRAM-SHA-1&tls=true&tlsCAFile=/Users/pengxin/data/real/pyml/cert/ca.pem&tlsCertificateKeyFile=/Users/pengxin/data/real/pyml/cert/client.pem" \
--db pyml --collection ml2_fs.files --drop \
/Users/pengxin/Downloads/print/ml2_fs.files.bson
更新最后一周的数据为未 sold¶
// 更新结果: 5254 条
db.properties.updateMany(
{ 'ptype': 'r', 'status': 'U', 'prov': { '$in': ['BC', 'ON'] }, 'offD': { '$gt': 20240810 }, 'sp': { '$exists': true } },
[
{ $set: { status: "A", sp_m0: "$sp", offD_m0: "$offD", 'lst': 'Pnd' } },
{ $unset: ["sp", "offD"] }
]
);
// 最新数据
db.properties.find({ 'city': 'Toronto' }, { 'onD': 1 }).sort({ onD: -1 }).limit(1)
// 统计行数
db.properties.countDocuments({ 'ptype': 'r', 'status': 'A', 'prov': { '$in': ['BC', 'ON'] }, 'sp_m0': { '$exists': true } })
PyCharm 查询条件:
{ 'ptype': 'r', 'status': 'A', 'prov': { '$in': ['BC', 'ON'] }, 'sp_m0': { '$exists': true } },
{ 'sp_m11': 1, 'sp_m12': 1, 'sp_m13': 1, 'sp_m19': 1, 'lp': 1, 'sp': 1, 'offD': 1, 'sp_m0': 1 }
重置预测结果¶
db.properties.updateMany({}, { $unset: {
sp_m1: "", sp_m1_acu: "", sp_m2: "", sp_m2_acu: "",
sp_m3: "", sp_m3_acu: "", sp_m4: "", sp_m4_acu: "",
sp_m9: "", sp_m9_acu: ""
}})
db.properties.updateMany({}, { $unset: {
sp_m11: "", sp_m11_acu: "", sp_m12: "", sp_m12_acu: "",
sp_m13: "", sp_m13_acu: "", sp_m14: "", sp_m14_acu: "",
sp_m19: "", sp_m19_acu: ""
}})
db.properties.updateMany({}, { $unset: {
sp_m1: "", sp_m1_acu: "", sp_m2: "", sp_m2_acu: "",
sp_m3: "", sp_m3_acu: "", sp_m4: "", sp_m4_acu: "",
sp_m9: "", sp_m9_acu: "",
sp_m11: "", sp_m11_acu: "", sp_m12: "", sp_m12_acu: "",
sp_m13: "", sp_m13_acu: "", sp_m14: "", sp_m14_acu: "",
sp_m19: "", sp_m19_acu: "",
sp_m6: "", sp_m6_acu: "", sp_m10: "",
sp_m15: "", sp_m16: "", sp_m17: "", sp_m18: ""
}})
找出评估值差距大的房源¶
db.ml_props_cleaned.aggregate([
{
$match: {
prov: "ON",
sp: { $gt: 0 },
spcts: { $gte: ISODate("2025-12-20T00:00:00Z") },
status: "U",
"saletp-b": "S"
}
},
{
$project: {
_id: 1, offD: 1, spcts: 1, sp: 1, sp_md_result: 1,
sp_ratio: { $divide: ["$sp_md_result", "$sp"] }
}
},
{
$match: {
$or: [{ sp_ratio: { $gt: 1.2 } }, { sp_ratio: { $lt: 0.8 } }]
}
}
]).forEach(doc => {
print(`_id:${doc._id}, offD:${doc.offD}, spcts:${doc.spcts}, sp:${doc.sp}, sp_md_result:${doc.sp_md_result}, sp_ratio:${doc.sp_ratio}`);
});
ES (Elasticsearch) 使用技巧¶
调试:把参数带入 template 只打印编译后查询语句¶
{
"id": "knn_search_template",
"params": {
"city_c": 1024,
"distance": 2000000,
"exclude_id": "DDF2749832100006",
"lat": 43.6638422,
"lon": -81.0236597,
"ptype2_l_c": 3,
"query_vector": [0.1, 0.1, 0.0, 1.0, 0.2, 0.8],
"type_is_sold": true,
"type_is_unsold": false
}
}