Monday, 23 April 2012

A Year of Mining Twitter


A year ago a small project was put together to mine Twitter for vulnerability ID prefixes.  There were two big motivators at the time that sparked this mini-experiment:
  1. Recent project work with integrating vulnerability intelligence feeds into a VM system raised a realisation that decent technical write-ups were rarely referenced in public feeds such as NVD
  2. Keeping up to date with day to day security news and research is difficult and time-consuming
Twitter seemed like a potential medium for tackling these two issues.  Twitter itself is rather fascinating - in the vast amounts of utter rubbish that flows through it every second, there is some genuinely excellent data in there.  

Mining for vulnerability prefixes was chosen for several reasons, including:
  • They are relatively unique character sequences
  • The mining process still works with unicode text 
  • People referring to public vulnerabilities often refer to them by ID explicitly
  • People who refer to vulnerabilities usually have some security relevance
  • Hash tags are stupid

A year has passed and this post is to share some observations over this time and to accompany an update to Talkback that now has general statistics and individual vulnerability ID lookups.

Disclosure Trends

The combination of research teams, vendor bug-bounties and vulnerability brokers appear to have made important changes to the way vulnerability information is publicised.  A stream of technical information that can be correlated to vulnerability ID's is often available from multiple sources.  This includes independent bug-hunters releasing write-ups once one of their vulnerabilities is published and research teams releasing technical advisories covering exploitability.  

Despite the fact that there will always be bug-finders who choose non-disclosure, these observed improvements to the quality of public technical information on vulnerabilities is invaluable. 


As the demographic charts show, there is quite a lot of variation between the location and language codes of the users captured in the inventory. It is likely that the dip in certain high population countries is due to the fact that Twitter is not as widely used in these particular countries. Conversely, the relatively high peak in Japan for example shows that it's a popular online communication medium there.
A strength of capturing foreign language items and tying in the Google language translate gadget in the Talkback UI is that it's not rare for technical write-ups and comments on vulnerabilities to be spread over many languages.  In 2012, language barriers for information such as this should not be a factor.

Vulnerability Hype

The most popular and heavier weighted vulnerabilities are generally related to a combination of community hype and solid technical research.
An example of hyped vulnerabilities were MS11-083 and MS12-020 due to the nature of both Windows issues, however at the time of this writing reliable RCE for both issues is not public. The general conclusion for this sub-point is there's unfortunately a lot of vulnerability fanboys out there. 

Funnily enough, the heaviest single tweet captured to date was a Chuck Norris joke regarding MS11-083.

Mining Gold

The biggest strength of the tool is to sift through the noise and pluck out the excellent research coming out from individuals and research teams. A few notable examples include j00ru's write-up on his Windows CSRSS privesc (CVE-2011-1281), VUPEN's write-up on a ProFTPd use-after-free (CVE-2011-4130), and Offensive Security's blog-post on a Afd.sys privesc (MS11-080).

Certain vulnerability spikes relate directly to observed malware, incidents, etc. Two examples of vulnerabilities with good timelines are CVE-2012-0507 (JDK) and CVE-2011-3544 (JRE).  To my knowledge, current public vulnerability intelligence feeds don't dynamically capture such detailed timelines.

More work is required to help distinguish such items programmatically from the rest.  In the meantime, it is recommended to have a look at the Popular Items section of the Talkback Statistics and the Trending Items section to see other items that received a high rating by the tool.

User Inventory

Out of the large amount of users in the inventory, it's important to note not all users have security backgrounds - there's the occasional sysadmin referring to patching systems and IT companies concerned of an issue, but this is still interesting data to capture.  However, on the whole, a large number of security-relevant users are in the inventory just due to the fact that being in this field it's almost inevitable to either mention or at least see a vulnerability ID when on Twitter.
A fun exercise is to simply browse the user inventory by using the different views and filters. There is the occasional upcoming or unknown researcher who mentions a vulnerability they discovered or researched but they have a tiny Twitter following. It is possible for this tool to be used for recruiting and the like, but more work would be required to make this facet truly powerful, and sparks a curiosity to what others with decent resources are doing in this space.

Next Steps

First thing on the table is to make improvements to the algorithms for highlighting popular research items, growing the statistics and analytics, and potentially bringing on more mediums to expand the geographic scope.

A goal for the near future is to enable viewing general trending items from security-relevant users across the globe.  This work is in progress and is still being tuned, but will in essence be like a view in Twitter, but span many languages, cut out a lot of the line-noise, and have careful consideration for how it's presented so catching up on daily news is efficient. 

Suggestions, bugs, and feedback can be sent to