dev_note.txt 1.02 KB
[2018-11-29]
1 crawler/crawler_sys/framework/scrap_list_page_async.py
1.1 lst_page_conf.ini 转入crawler/crawler_sys/config/sites/list_page_urls.ini;
1.2 list_page_urls.ini 中每个site一个[section] header,所有的site名称要与crawler/crawler_sys/framework/platform_crawler_register.py保持一致;
1.3 args.platform default='' (现在是'腾讯视频'), 在parse arg的时候判断,如果platform参数=='' 直接退出;
1.4 args.platform 如果不为空,判断是否在latform_crawler_register.py里,如果不是,程序退出。

2 名称规范,包括文件名和函数名(最低优先级,可以最后有时间再改)
lst_page -> list_page

[2018-12-25]
1 for releaser_page crawler, the name of function must be releaser_page so that we can import the function in framework
2 for the releaser_page function, the input variable is releaserUrl, other functions such as get_releaser_id and get_releaser_uk must be included in this function 
3 es_index and doc_type must be given so that we can reduce some if/else in output process. At the beginning, if es_index is None, the es_index is default to crawler-data-raw