Exploring My Twitter Social Graph Using the Twitter API
Who should I be following that I’m not?
There is a certain amount of social currency that gets passed with a follow on Twitter, much like the “link juice” that gets passed between websites through interlinkage. Between websites, this linkage is discovered by Google’s web spiders and is a factor that determines search ranking. Within Twitter, I think that a follow implies a type of endorsement from one account to another.
Twitter thinks so too. However, it seems like Twitter is basing this currency on who the people you follow (henceforth your “friends”, a term which always creeps me out to use), and not necessarily who your followers are following.
Thinking about my social graph in this manner led me to a few a simple questions:
1. Who were my followers following the most?
2. Who were my friends following the most?
3. How much divergence in the two groups is there?
Using some Python (and the twitter package), I whipped up a script that pulls the user profile information of any account that is among the top 1,000 most-followed accounts among my followers and/or my friends.
The code outline is pretty simply:
1. Get a list of the Twitter user id’s of both my friends and my followers. Combine them into a set.
2. Iterate through each member of the set, pulling the Twitter user id’s of each of those account into a Counter object for friends of friends, one for friends of followers, or both.
3. Combine the top 1,000 user_id’s by count from each Counter object into another set.
4. Pull down the extended user information from the API.
I should note that I am not pulling down the relationships between accounts from users with protected timelines. Since this information is part of the protected dataset, I do not include it in my counts. This is a very small factor in the overall counts (<1%), but I feel like using that data, even in aggregate, violates the trust of the users that I’m following that protect their timeline.
Since Twitter rate-limits usage of their API, I had to write code to rate-limit my calls. This seriously slows down the process – the entire data pull takes a full day to run.
The ranked order of accounts is accessible on my site on my Twitter Social Graph page. The page defaults to the most-followed accounts of my followers (which naturally puts me at the top, since all of my followers follow me). You can also see the top accounts followed by my friends.
That interface completely adequately answers questions 1 and 2. I can use these pages to find potential friends that have the implicit endorsement of my friends or followers.
Question 3 is also easily answered. The two groups formed a set with 1,292 records (as of the first generation of the dataset, which was created on 12/20/14). Since each group contains a record that must belong to one or both groups, we are assured of a mutuality between the groups of 708 records. Not a lot of divergence.
This is a pretty boring way to end a post, so let’s have some fun with R using this new dataset.
Code here, inspired by this post:
friend_descriptions <- friends friend_corp <- Corpus (VectorSource(friend_descriptions)) friend_corp <- tm_map(friend_corp, tolower) friend_corp <- tm_map(friend_corp, removePunctuation) friend_corp <- tm_map(friend_corp, removeWords, stopwords("english")) friend_corp <- tm_map(friend_corp, stemDocument) friend_corp <- tm_map(friend_corp, stripWhitespace) friend_corp <- tm_map(friend_corp, PlainTextDocument) wordcloud(friend_corp, scale=c(6,0.5), max.words=250, random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(6, "Dark2"))
Since I have the data sets extracted, why don’t we look at the relationship between followers, friends, and verified status of each group? Sounds like a perfect situation to use an animated scatterplot across the datasets!
There is nothing I don’t love about this chart, from the logarithmic scale to the multi-dimensionality to the animation across datasets. The chart also unveils a few things about my social graph:
1. A lot of accounts bang right up against that friend:follower ratio limit that Twitter imposes (which looks like it is pegged right at 1).
2. The accounts that do are much less likely to be verified. Verified accounts typically have a much, much lower friend:follower ratio. I’d guess that they don’t NEED to follow a lot of accounts to be well-known.
3. My friends are following fewer high friend:follower accounts and fewer 100,000+ follower accounts than my followers do.
Here’s the code for this:
#repeat this for each dataset, changing the plot name from plot1.png to plot2.png and plot3.png png("plot1.png") ggplot(followers, aes(y=followers_count, x=friends_count)) + geom_smooth() + geom_point(shape=19, alpha=3/10, aes(color=factor(verified)), ) + scale_color_manual(values = c("orange", "purple"), name = "Verified Status", labels = c("Not Verified","Verified")) + scale_y_log10(limits=c(1,10000000), breaks=c(1,10,100,1000,10000,100000,1000000,10000000), labels=c(1,10,100,"1,000","10,000","100,000","1,000,000","10,000,000")) + scale_x_log10(limits=c(1,10000000), breaks=c(1,10,100,1000,10000,100000,1000000,10000000), labels=c(1,10,100,"1,000","10,000","100,000","1,000,000","10,000,000")) + labs(title = "Most-Followed Accounts of Followers", x = "Log10 Following Count", y = "Log10 Followers Count") + theme_bw(base_size = 12, base_family = "Helvetica-Narrow") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) dev.off() system('C:/"Program Files"/ImageMagick-6.9.0-Q16/convert.exe -delay 200 *.png example_1.gif')
— Tom Miller (@tmllr) December 11, 2014