Handling big data of online social networks on a small machine
© Jia et al.; licensee Springer. 2015
Received: 2 November 2014
Accepted: 24 February 2015
Published: 14 March 2015
Dealing with big data in computational social networks may require powerful machines, big storage, and high bandwidth, which may seem beyond the capacity of small labs. We demonstrate that researchers with limited resources may still be able to conduct big-data research by focusing on a specific type of data. In particular, we present a system called MPT (Microblog Processing Toolkit) for handling big volume of microblog posts with commodity computers, which can handle tens of millions of micro posts a day. MPT supports fast search on multiple keywords and returns statistical results. We describe in this paper the architecture of MPT for data collection and phrase search for returning search results with statistical analysis. We then present different indexing mechanisms and compare them on the microblog posts we collected from popular online social network sites in mainland China.
Dealing with big data in computational social networks may require big machines with big storage. This may seem that only large companies or organizations with lucrative budgets can afford big-data research. We show that, by focusing on a specific type of data, it is possible to carry out big-data research on online social networks (OSNs) using commodity computers in a small lab environment. In particular, we present a system for handling big volume of microblog posts (MBPs). Our system is called MPT, which stands for Microblog Processing Toolkit.
MPT collects MBPs from popular OSN sites in mainland China, identifies interesting topics from MBPs, and carries out statistical analysis on each topic. The statistical analysis includes gender and location distributions, frequent words, and trends. We have collected on average approximately 4.5 million (sometimes over 10 million) MBPs a day since September 2012. We stored these MBPs in Mongo DB running on a commodity computer.
To make use of these data, MPT is required to support, among other things, phrase search that will quickly return the set of MBPs that contain a set of words entered by the user and the statistical results of these posts displayed in various graphs. For this purpose, we would need to create an appropriate indexing mechanism and update the indexing content regularly. In addition, we also want to retrieve MBPs in real time while we are collecting them, so that we may detect unexpected social events and perform other tasks.
In this paper, we present three indexing methods deployed on commodity computers. Without using clusters of computers, we were able to build a system suitable for implementing a fast search engine and carrying out topic modeling and statistical analysis on large volume of MBPs.
The rest of the paper is organized as follows: In the ‘Data collection’ section, we will describe the data source and the API (Application Programming Interface) we used to collect MBPs. In the ‘Database’ section, we will introduce the database we use to store the data and describe some of the problems we encountered when storing MBPs. In the ‘Data retrieval’ section, we will describe a number of indexing mechanisms, including the default Mongo DB queries using regular expressions, our own implementation of the nextword indexing , and an indexing system we built based on Lucene . In particular, we will describe the structures of the systems for indexing, searching, and carrying out statistical analysis. In the ‘Statistical analysis’ section, we will compare the speed of querying, the speed of performing statistical analysis, and the accuracy of each method on real data sets. We conclude the paper in the ‘Experiments’ section.
Under the restriction of the user privilege given to us, we collected on average about 4.5 million (sometimes over 10 million) MBPs a day from these OSN sites. These MBPs were semi-structured JSON style records.
To handle unstructured MBPs in large quantity, we would need a high-performance, steady, and flexible database system. Because MBPs are unstructured, we chose Mongo DB  for this purpose, which is a common choice for storing unstructured data.
Different MBPs from different sources use different data structures. With Mongo DB, we can store data in different data structures in the same collection. This makes it convenient to manage the data. Moreover, Mongo DB is a database scheme with high performance on the operations of both read and write, which meets our need of intensive writing and querying MBPs.
Initially, we stored in one collection the MBPs posted on the same date and stored all the collections in the same database. After running the system for a few weeks, we experienced unexpected system crashes. The reason was that Mongo DB would write data in the same database into the same file and load all the files that are accessed frequently into the main memory. As more MBPs were stored in the same database, the file grew larger quickly, causing Mongo DB to consume almost all the RAM and crashing the system. To solve this problem, we divide the collections of MBPs according to a fixed time interval of 1 week into different databases. Because writing to the database was the main operation of the system and the system only collected real-time data, Mongo DB would load the file of the most recent week into RAM and thus consume much less RAM than before. The system has never crashed after we made this change.
We note that we may also consider using an SQL-based database for reliability.
Given a keyword, retrieve all MBPs that contain the keyword.
Given a set of keywords, retrieve all MBPs that contain at least one of these keywords (this is the logical OR operation).
Given a set of keywords, retrieve all MBPs that contain all of the keywords (this is the logical AND operation).
We approached these tasks using the following three methods.
Mongo DB regular expressions
Mongo DB provides a built-in regular expression searching method. Given the regular expression we want the text to match, Mongo DB returns all the records that match the regular expression. Then, the system will go over all the MBPs and count MBPs for each statistical feature we are interested in, such as gender, location, and sentiment. This search method, however, is inefficient and does not meet our needs. To speed up the search process, we developed an indexing system based on the nextword indexing scheme .
Based on the nextword indexing, we built a system consisting of two parts: (1) an index server and (2) a search server. The index server indexes in real time the collected MBPs. In particular, we store the word pair list in the main memory and the information of each word pair in the database. This list contains the position of the word pair in the text and the information of the MBPs. The position of the word pair (w,s) in M is the same as the position of w in M.
Our early version of the system stored each word pair in RAM. We observed that, for less frequent words w, if we only stored the word-pair position of w in RAM but not the word pairs (w,s), then the search speed would only be mildly affected, while the consumption of RAM would drop tremendously. Because retrieving a word pair (w,s) not stored in RAM will incur more time, to balance between the search speed and the RAM occupancy by the nextword indexing, we set a threshold value on word-pair counts, so that the system only stores in RAM the nextword list of words with frequencies over the threshold. This measure cuts down the RAM usage significantly.
After several weeks of running the system, we encountered another problem of data explosion: The number of MBPs provided through the APIs suddenly increased significantly, more than twice the size of the data we normally collected in 1 day. Likewise, the index size was also increased to occupy about 5 GB of RAM. Since we deployed the system on a commodity computer, we could not afford such RAM usage. This motivated us to devise a low-cost RAM solution. We accomplished this using Apache Lucene.
For convenience, we will call a person who authors or retransmits an MBP a sender. For each MBP, we will include in one string its sender’s location and gender, the posting time, and the OSN this MBP is transmitting within. This string is used as the indexing key of the MBP.
In addition to returning the MBPs, the Microblog Processing Toolkit also returns a number of statistical results of these MBPs, including the proportion of genders, the provinces of the senders, the sources of the MBPs, and the distribution of the posting time. Moreover, hot words (i.e., words with high frequencies) are generated while returning the MBPs.
To carry out statistical analysis on the MBPs we have collected, we group them according to the following three attributes: 1) posting time, 2) sender’s gender, and 3) sender’s location.
Using the built-in regular expressions of Mongo DB, we would only need to traverse all the MBPs the database returned and then carry out the statistical analysis. This process, however, is extremely time consuming.
Using the nextword indexing, we could complete the statistical analysis in just a few steps. Since the nextword index file contains the position of each word pair and the required statistical features, the system can carry out statistical analysis by traversing the position list for the given input phrase to be searched for without querying the entire database.
We need to customize our own statistical features. As mentioned earlier, for each MBP, we include in the index its sender’s location, gender, and posting time in one string. We set this string as the key of the MBP. This means that for two MBPs posted during the same hour, at the same location, and by the senders of the same gender, they will share the same key. When indexing the MBP, we include this key in the index. When searching for the data, we first carry out the group search by the key. After obtaining the groups with the same key, we traverse each group and perform the statistical analysis in the group (details of statistical analysis are not included in this paper).
The trending graph, gender distribution, location distribution, and source OSNs can be easily calculated by a simple traversal of the the MBPs in each group. However, the hot words and the top senders cannot be obtained this way since each MBP may have its unique top tf-idf words. Instead, MPT traverses all the MBPs contained in each group and adds all the keywords and senders to two different maps. It then analyzes these two maps to obtain the list of hot words and the list of top senders.
We designed an experiment to test the efficiency and accuracy of the three methods for data retrieval and statistical analysis mentioned in the previous sections. For convenience, we will refer to these methods of Mongo DB’s built-in regular expressions, nextword indexing, and Lucene-based indexing as, respectively, Mongo, Nextword, and Lucene. We used the MBPs we collected in 1 day (the day was randomly chosen) from Sina and Tencent as the data set for comparing the three different methods. There were about 4.3 million MBPs in this 1-day collection. All experiments were executed on a commodity computer with a quad-core CPU and 16-GB RAM.
In practice, the system is designed to handle the posts of multiple days with volume much larger than the data used in the experiment. For a dataset of very large scale, our system will slice the dataset into multiple subsets. In particular, for a dataset that contains posts in multiple days, the system will slice the dataset into multiple subsets such that each subset contains the posts of the same day. Thus, our experiment does indicate the fundamental performance. We repeated the experiment on datasets from different dates with similar results.
The experiment consisted of two parts. In the first part, we examined how fast each method responded to user queries, as well as how fast the method carried out statistical analysis and returned the results. In the second part, we compared the accuracy of each method.
To run the experiment, we pre-counted all the keyword phrases in the data set and randomly selected a number of keyword phrases as our testing phrases, such that these phrases were made up of two or three keywords and appeared in the test data set for more than 100 times.
First, we compared the responding time of each method. The horizontal axis represents the actual frequency of keywords, and the vertical axis represents the running time of returning the MBPs that contain the keywords.
Lucene offers the fastest responding time, and the responding time will increase as the phrase frequency increases.
Nextword is faster, which is slower but close to Lucene, but its time complexity is not as steady as Lucene.
Mongo is the slowest, which does not meet the real-time search requirement.
We then compare the speed of carrying out statistical analysis for each method. From Figure 7, we can see that Lucene is still the fastest, Nextword is slower than Lucene, while Mongo is substantially slower. The time interval between responding to the search query and finishing up statistical analysis is quite short for Lucene and is much longer for the other two methods. We note that the mechanism for carrying out the statistical analysis of each method is different. By querying Mongo, we would need to traverse all the MBPs returned on the query, which would incur significant computing time when the returning data set is large. The nextword indexing is a well-structured indexing mechanism, where no traverse is needed to perform statistical analysis, and so it can finish the statistical analysis quickly. Lucene finishes statistical analysis by traversing all the groups instead of all the posts.
We can see from Figure 7 that the time complexity of Mongo search does not increase much when the frequency of the keywords is increased, while the time complexity of Nextword search and the Lucene search is clearly related to the frequency of the keywords.
From Figure 8, we can see that the accuracy of each method differs from each other, where Mongo is 100% accurate. In other words, its precision and recall rates are both equal to 1.
Nextword may miss some MBPs. The main reason of missing MBPs is due to the segmentation error of the keywords in the Chinese language. The keyword segmentation in the Chinese language is different from that of English, for the standard Chinese writing contains no space between characters. Thus, different segmentation tools may return different segmentation results. Even for the same keyword using the same segmentation tool, the results may still be different with different text. For the keywords we queried in the experiments, the segmentation result in the MBPs could differ from that in the search query. The MBPs with different segmentation results would be missed.
Lucene, on the other hand, seems to have the worst precision and recall rates, where the number of returned MBPs is usually larger than the actual number that contains the phrase. This is caused by the Lucene indexing structure, where all the MBPs that contain a subset of the search keywords are returned. For example, if phrase X is made up of words A and B, when querying X, Lucene will return all the posts that contain A or B. So, the returning result usually has a larger-than-actual count.
Experimental results on precision and recall rates
From our experiment, we can see that each method has its pros and cons. In particular, Mongo provides the regular expression searching method with the perfect accuracy. But it is too slow to meet the needs of real-time querying and stat analysis. Nextword has the best performance in statistical analysis, and the responding time can meet the need of real-time search. But its memory consumption is high, which cannot meet the increasing data requirement. Lucene, on the other hand, has the fastest responding time and the fastest statistical analysis time, but it incurs low accuracy. This method is good for building a fast search engine but may not meet the requirement of high accuracy.
We note that the statistical result returned by the system is in a tree-structured JSON format, which contains the information we are interested in. Such information is straightforward to obtain with little extra time. For example, for MBPs of a particular topic, it is easy to obtain from the statistical result the number of postings by users of a particular gender throughout a particular period of time of the day.
Mongo is good for storing data but not suitable for carrying out search on big data.
Nextword offers good performance on real-time search with fast response time and statistical analysis time. But it would take up too much RAM. Nextword would be a good choice for analyzing moderate-size data (e.g., less than 3 million MBPs) with high requirement on statistical analysis.
Lucene is a low RAM-consumption and stable system. But it has the worst performance on precision and recall. For those who need to analyze big volume of data but do not require exact statistical results, Lucene would be a better choice.
The authors were supported in part by the NSF under grants CNS-0953620, CNS-1018422, and CNS-1247875. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF. The authors thank Peng Xia and Liqun Shao for participating in various discussions of the project.
- Williams, HE, Zobel, J, Anderson, P: What’s next? Index structures for efficient phrase querying. In: Australasian Database Conference, pp. 141–152, Auckland, New Zealand (18-21/1/1999).Google Scholar
- The Apache Software Foundation: Apache Lucene (2/1/2015). https://lucene.apache.org/.
- Sina corporation: Open Documentfor Sina Micro-blog API (2/1/2015). http://open.weibo.com/wiki/Statuses/public_timeline/en.
- MongoDB, Inc: MongoDB (2/1/2015). http://www.mongodb.org/.
- Bahle, D, Williams, H, Zobel, J: Compaction techniques for nextword indexes. In: International Symposium on String Processing and Information Retrieval, pp. 0033–0033, Laguna de San Rafael, Chile (13-15/11/2001).Google Scholar
- Wantology Corporation: Wocson (2/1/2015). http://www.wocson.net/.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.