Name
Last commit
Last update
crawler_log Loading commit data...
crawler_sys Loading commit data...
dev Loading commit data...
documentation Loading commit data...
elasticsearch_7 Loading commit data...
gm_types Loading commit data...
gm_upload Loading commit data...
locust_test Loading commit data...
maintenance Loading commit data...
re_set_releaser_page_crawler Loading commit data...
tasks Loading commit data...
test Loading commit data...
write_data_into_es Loading commit data...
README.md Loading commit data...
git-2.1.0.tar.gz Loading commit data...
google-chrome-stable_current_x86_64.rpm Loading commit data...
requirements.txt Loading commit data...
run.sh Loading commit data...
start_crawler.sh Loading commit data...

crawler

发布者页爬虫

  1. 部署在BJ-GM-Prod-Cos-faiss001/srv/apps/ crontab -e
  2. sudo su - gmuser
  3. workon litao
  4. 抓取程序 nohup python /srv/apps/crawler/crawler_sys/framework/update_data_in_target_releasers_multi_process_by_date_from_redis.py > /data/log/fect_task.log &
  5. 写入抓取url程序 python /srv/apps/crawler/crawler_sys/framework/write_releasers_to_redis.py -p weibo -d 1 -proxies 2

##搜索页爬虫 python /srv/apps/crawler/crawler_sys/framework/search_page_single_process.py

数据周报

服务器 airflow002

  1. 切换权限 sudo su - gmuser
  2. source /srv/envs/esmm/bin/activate
  3. python crawler/crawler_sys/utils/get_query_result.py /opt/spark/bin/spark-submit --master yarn --deploy-mode client --queue root.strategy --driver-memory 16g --executor-memory 1g --executor-cores 1 --num-executors 70 --conf spark.default.parallelism=100 --conf spark.storage.memoryFraction=0.5 --conf spark.shuffle.memoryFraction=0.3 --conf spark.executorEnv.LD_LIBRARY_PATH="/opt/java/jdk1.8.0_181/jre/lib/amd64/server:/opt/cloudera/parcels/CDH-5.16.1-1.cdh5.16.1.p0.3/lib64" --conf spark.locality.wait=0 --jars /srv/apps/tispark-core-2.1-SNAPSHOT-jar-with-dependencies.jar,/srv/apps/spark-connector_2.11-1.9.0-rc2.jar,/srv/apps/mysql-connector-java-5.1.38.jar /srv/apps/crawler/tasks/crawler_week_report.py