Overview
In the era of mass communication, Internet provides a wider stage for the growth of netizens, the emerging media such as Facebook, Twitter, micro-blogs, and forums have become powerful tools to spread public information. Compared with traditional media like newspaper, TV, the information spreads with great speed over Internet. Along with fast development of Web 2.0 technologies, everyone could express its opinion on anything or anybody. In other words, everyone can be the author.
However, we know that a coin has two sides. Instant propagation of public opinion on unexpected incidents will spark social unrest with destructive consequences. For the government security department, they need to get command of the situation and the trend of public opinions and employ technical engineers to watch Internet arduously Unfortunately, currently the Internet has so many websites (the below figure shows websites growth around the world by year) that the amount of information is enormous; the information is updated constantly and has a wide influence range. To deal with so much information by human being in a traditional way makes the task of monitoring public opinion arduous and unrealistic.
Therefore, it is quite necessary to implement an automatic monitoring system which can instantly discover the negative information from the global websites and actively guide the public opinion.
Solution Architecture
A secure method is applied for crawling and saving web content while users are browsing the web.
The collected data will be saved in structured format after ETL processing to facilitate further analysis and investigation. The distributed crawler structure supports crawl data from multiple, disparate queries simultaneously. The architecture of whole solution is distributed designed and can be easily scaled up with the data amount increasing over time.
The solution records the query logs including search terms, search parameters, query logic, query algorithms, etc. Future use of these knowledge management amongst local user community or organization can be customized.
The solution supports supply with standard API interface to call external services, the executing program languages supports C#, Java, JavaScript, .Net, Python, C++, PHP, etc.
Crawler engine of the solution can collect virtually any Web data, from any HTML-based Web source, including simple static HTML based sites, rich dynamic Web pages, social network sites, portals, forums, blogs, news. Deep Web includes dynamic pages that require domain knowledge, unlinked content that is difficult to reach (for example, data secured by user-password login or access blocked by CAPTCHAS). Dark Website collection extracts data from the web layers that are inaccessible from regular Web browsers, which need TOR proxy node to open.
The solution can access dark websites in a secure method (Tor), the user’s IP and identifiable information will not be exposed, by applying automated proxy detention, scoring anonymizers, and virtual identities technology.
The solution can collect information in time from various online open sources:
Social network (Facebook, Twitter, and YouTube),
Paltalk, forum websites
News websites
…
Based on collected data, the solution will make kinds of analytic work to get more intelligence and let the user work more efficiently. Furthermore, the security agency can actively guide public opinion based on the collected and extracted intelligence.