Building a Wordlist using Youtube
Thursday, September 13, 2012 | Author: Deep Flash
A good wordlist is essential to increase the probability of cracking hashes. It becomes even more significant based on the type of algorithm you are attacking.

For fast algorithms such as unsalted MD5 and SHA-1, a bruteforce attack for passwords of upto length 9 using a fairly recent GPU is feasible these days.

A good ruleset can bring the best out of your wordlist. While your wordlist may not be very good, an efficient ruleset will help in increasing the probability of cracking hashes using it.

Here comes the important point, when you are attacking algorithms like SHA512 (crypt), bcrypt, WPA/WPA2 handshakes, your reliance on the wordlist increases even more.

Till date, these algorithms have not been accelerated (either on GPU or CPU) to an extent that will allow you to bruteforce them or even run hybrid mask attacks and word mangling rule based attacks.

Your chances of cracking a WPA/WPA2 handshake is as good as your wordlist. If the passphrase is not there in your wordlist, then you can move on to capturing another WPA/WPA2 handshake.

Note: Elcomsoft has demonstrated that cracking of WPA/WPA2 handshakes can be accelerated using an FPGA or an Array of FPGAs, but even this set up does not allow rule based attacks in a reasonable time span.

This is the reason, we need a better wordlist.

Not long ago, someone made a blog post about using Twitter to build a wordlist. His idea was based on using the Search API provided by Twitter for Developers, who can query for a specific keyword and retrieve 'N' number of recent tweets.

While this method is effective indeed, your success also depends on the keywords you search for.

The URL provided by Twitter Search API to make queries looks like:

http://search.twitter.com/search.json?q=&rpp=

It sends a JSON response which can be easily parsed using the JSON libraries provided by the scripting language you are using.

In Perl, with JSON and JSON::XS, we can easily parse the response.

If you have ever used Twitter's Search API to build a wordlist, you might have encountered a Rate Limit.

Rate Limits are set by Twitter on the Server Side to prevent the abuse of the Search API, so that developers do not write a code which sends a huge amount of requests to fetch tweets and end up slowing down the Server.

The advantage for us is, Twitter Search APIs do not have strict rate limits. The count of the number of requests you can make has not been made public but it is higher than the limit set for other APIs like the REST API.

They also provide a good documentation which states that an HTTP Header field will be set in the Response from Twitter Server once you reach the rate limit.

Retry-After: x number of seconds.

This is useful for developers because now they can include checks in the scripts to look for this field in the HTTP Response Headers and wait before executing the next operation.

In Perl, it would be:


The value of the Retry-After header depends on the number of tweets you are requesting from the Search API. I observed this based on my experiments.

A quick lookup on Wikipedia for Twitter tells us that there are 500 Million Users on Twitter (as per their last update). This means, a strong wordlist.

Another good point about using Twitter's Search API to build a wordlist is that, it is dynamic. This means, two requests sent to Twitter for the same keyword at different points of time, will give different results.

Unlike, building a wordlist using Facebook by crawling the Names Directory where the results returned are based on the directory that you are crawling, with Twitter it is easier since it is more dynamic.

We can get much more words with lesser work :)

If you feel that the Twitter API is not so developer friendly due to the rate limits imposed by it, then welcome to GData API of Youtube :)

I use Youtube to build my wordlist. Every user on Youtube has the option to subscribe to other channels on Youtube.

Each channel will have a unique name.

While it is not possible to query a particular channel using the GData API and retrieve all the subscribers of that channel. It is still possible to view the list of all the subscriptions of a particular user.

Changes are often made to Youtube's GData API, so it cannot be said for how long this feature will be kept.

There are also other restrictions imposed by Youtube which make it difficult to build a wordlist using it.

1. The maximum number of subscriptions you can view in one request = 50.
2. You cannot view subscriptions beyond index 1001.
3. The JSON response from Youtube is much more complicated as compared to the response from Twitter's Search API. However, once you have figured it out, it should be easy.

Below screenshot shows the code in action:



Due to the large user base of Youtube, this script would need to run for months.

I shall update the statistics regularly in this blog post.

That's all for now :)

Thank you for the APIs to Youtube and Twitter.

Update #1 (as of 19th September 2012)

Twitter - 7527213 (approx 7.52 Million)

Update #2 (as of 19th October 2012)

Twitter - 8898721 (approx 8.89 Million)
Youtube - 218902

Important Observations:

In the case of Twitter, a lot of duplicate usernames will be collected while running the script. The count posted above, filters all the duplicates.

Also, most of the usernames will have an underscore character in them which would replace the spaces in between the username.

To process the wordlist, I remove all underscores and convert all the entries to lowercase. More combinations will be generated by rulesets at the time of cracking.

cat twitter.txt | sed -e 's/_//g;' | perl -pe '$_=lc($_)'

Listening Now: Lee Jung Hyun - Bakkwo