Sogou Labs Shares Useful Data
Sogou, the search engine of Sohu, launched its labs recently. The labs will show the innovative products, product prototypes, data on search and Chinese characters, and research reports on search by Sogou engineers.
Currently, products and prototypes in its labs webpage include Sogou Chinese Characters Input software, Sogou Ranks, Webpage Auto Categorization, that is a prototype to classify any Chinese webpages into some predefined categories.
However, most important, I think, is the data on search shared by Sogou. The data shared sofar include
- user search log, you can use these data on searching keyword, the URL clicked by users, among others, to analyze Chinese internet user search behavior, for example.
- Pre-categoried text data for research in auto categorization.
- most used words database: about 150,000 most used Chinese words, their using frequency and part of speech.
- internet corpus database: Chinese corpus database from around 40 million webpages
The data can be used in any non-commercial projects with credit to Sogou. I like the open attitude of Sogou, it may help to harness collective intelligence to advance the research in Chinese search engine.
-
Related Posts
2 Responses to “Sogou Labs Shares Useful Data”
Post a comment
Subscribe
Interesting. I downloaded the data but I wouldn’t have a clue how to open them. Have you downloaded the data and do you know how to open the file and which software to use.
Cheers,
G.
You mean tar.gz compressed file? you can use Winrar to decompress it.