319 Event Data Collection
目的
基於文化 (歷史) 與情報考量,大量收集事件相關資料。
有隱私問題的資料要處理或略過。
範圍 (暫定)
- PTT post
Simon Pai用 tkirby 的工具可砍
- Facebook post
- news report
- blog article
- video stream record
UStream recorded video 有 timestamp
- video transcripts
- misc video clips
- 立委質詢影片
- https://hackpad.com/323-vA3xcpQnSCB
- https://www.youtube.com/user/marktwaingroup
youtube 還沒找到 timestamp orz
- g0v IRC
- padnews parse hackpad 文字直播的結果:
- http://padnews.linode.caasigd.org/
- API
- latest: http://padnews.linode.caasigd.org/json/
- all: http://padnews.linode.caasigd.org/json/all/
- single entry: http://padnews.linode.caasigd.org/json/0/ repos
- parser: hthttps://github.com/g0v/padnews cli: hthttps://github.com/g0v/padnews-cli web: hthttps://github.com/g0v/padnews-web 實體報紙
十分陽春,若有更好的處理方式,歡迎 patch 或告知。
cassi ++ 非常好閱讀的介面
- photo albums?
- 小道消息?
格式需求
- 內容
- 來源
- timestamp 或 time range
Video: tag/comment by timestamp?
Tools
- https://github.com/zbryikt/ptt-crawler
- https://www.npmjs.org/package/streamy-data
- https://github.com/g0v/padnews
Known Source Sites
- http://taiwan0314.s3-website-ap-northeast-1.amazonaws.com/
- http://www.appledaily.com.tw/realtimenews/article/new/20140329/369121/
- http://time-fumao.rhcloud.com/index.html
Application
use file hash to identify duplicated resource?
news archiving