Adam Green
Twitter API Consultant
140dev@gmail.com
781-879-2960
@140dev

This week we got crushed by the State of the Union speech. We normally get about 30,000 to 50,000 tweets per day in the 2012twit.com database, and our largest server can handle that without any showing any appreciable load. During the SOTU tweet volume exploded. We got 500,000 tweets in about 4 hours. I was able to keep the server going by shutting down some processes that weren’t needed, but it was a challenge. This issue of bursts of tweets seems to be getting worse. In the case of Twitter and politics people are getting used to talking back to the TV through Twitter. With 9 months left until the election I needed to find some solutions.

I spent a lot of time over the last 2 days trying to find the problem, and discovered that it was not parsing the tweets that was killing us, but inserting the raw tweet data into the json_cache table. I use a two phase processing system with the raw tweet delivered by the streaming API getting inserted as fast as possible in a cache table, and then a separate parsing phase breaking it out into a normalized schema. You can get the basic code for this as open source.

It looks like Twitter has been steadily increasing the size of the basic payload that it sends for each tweet in the streaming API. That makes sense, since people are demanding more data. Yesterday they announced some insane scheme where every tweet will include data about countries that don’t want tweets with specific words to be displayed. This will only get worse.

I realized that I have never actually needed to go back and reparse the contents of json_cache, and I had long ago added purging code to my 2012twit system to delete anything in that table older than 7 days. I tried clearing out the json_cache table on my server and modifying the code to delete each tweet as soon as it was parsed. This cut the size from several hundred thousand rows on average in this table to about 50. The load on that server dropped right away and during the GOP debate last night, the load stayed very low.

One of the account building services we perform for clients is building lists of suggested followers. Unless you are a celebrity, the only way to build a large follower list is to follow others. Twitter offers suggestions for follows, but you aren’t given any control over the criteria for selection. I recently built a list of thousands of Mommy tweeters for a client who wants to engage with this group, but the procedure I used can be applied to any demographic or interest group.

My starting point for identifying the best people to follow is to start with a good list of accounts with a very targeted interest, and then find out who follows the most people on this list. To find people interested in parenting issues, I started with this list of 52 mommy tweeters. I read all the screen names in the list into a MySQL database with the /lists/members API call. These accounts would be the “seeds” of the rest of my lead generation. Before working with this list I removed @JennyMcCarthy, since her celebrity following of 500,000 people distorted the overall numbers.

I then ran a script that collected the user id of every follower of everyone on this list with the /followers/ids API call. This API returns 5,000 followers at a time, but it still took a couple of hours to run. For each follower found I inserted a record into a table that held the screen name of the person they followed and their user id. If someone followed 10 of the mommies in this list, they were inserted 10 times. Once this was done, I had a list of over 950,000 follow relationships. Asking MySQL for the distinct number of user ids told me that there were over 650,000 unique accounts I could draw on as leads. It looks like mommy tweeting is a very popular subject on Twitter.

What I then needed to do was filter out the accounts that weren’t worth following, such as users who hardly ever tweet, or who use the default avatar image. Unfortunately the REST API only gives you the user id of account followers, and retrieving the rest of the user info for an account is greatly restricted by rate limits. There was no way I could get the info on 650,000 accounts. So I did a pre-screen by selecting the 20,000 users who followed the most members of the mommy list with this SQL query:
SELECT COUNT( * ) AS cnt, user_id
FROM mommy_list_followers
GROUP BY user_id
ORDER BY cnt DESC
LIMIT 20000

The result of this query was inserted into a separate table that could be used to gather a complete profile on these 20,000 users. I then ran a script that used the /users/lookup API call to get account info on 100 users at a time. Now I was ready to select the best potential follows. I did another SQL query that deleted users who:

  • Did not have an account description
  • Used the default avatar image
  • Had less than 500 friends or followers
  • Had only tweeted 50 times or less
  • Had a ratio of friends/followers that was less than 50%
  • Had a protected account
  • Was already a friend or follower of my client’s account

The result was a very clean list of 16,500 accounts who each followed at least 6 of the original mommy list members. I made this available to the client as a web page report that made it possible to follow any of these leads with a single click. This is a small sample.

One of the coolest clients we have now is the Buddy Roemer for President campaign. Buddy may not be the best known candidate, but he does have some unique campaign issues. He’s fighting money in politics, and as part of that mission he is limiting himself to a max of $100 in donations per person. That means that he has to be creative about getting free marketing. We’ve been building a complete Twitter campaigning system for Buddy over the last three months, which includes a detailed reporting system on his and supporter’s tweets, and a collection of automation tools. The most successful tool we’ve created has been a Donate a Tweet system.

The donate a tweet system is pretty straight-forward technically. Here is how it works:

  • A “Donate Tweets to Buddy” Twitter app was created with read and write privileges.
  • A Twitter user visits the donate a tweet page on BuddyRoemer.com and clicks the link to become a donor.
  • They are directed to the Twitter OAuth signup page where they authorize the app.
  • They are directed back to a Thank You page at BuddyRoemer.com and their OAuth tokens are saved in a MySQL database.
  • Once per day the Roemer campaign uses the admin system we created to enter a tweet for that day.
  • A cronjob runs once every 3 minutes and posts the tweet to each of the donor accounts.

One thing we have to watch for is users who have revoked the app. Twitter doesn’t notify us directly when that happens, but there is an easy trick. If you try performing a REST API function for a user that has revoked an app, the API returns an error code of 401. My code looks for that error every time a tweet is posted to an account, and flags the user as revoked, so that account won’t be used again. If the user comes back later and authorizes the app again, the revoked flag is removed in the database.

A really nice viral aspect of donated tweets is the control Twitter gives you over the source or “via” field of the tweets you post through the API. You can enter any text for this, and add a URL that is linked to that text. This makes each donated tweet a mini advertisement. In our case we point this URL back to the donate page, so more people can sign up.

So how successful has this tweet donation system been? There are several ways to measure this:

  • Since it went online on New Year’s Eve, 227 people have signed up, and of these only 24 have revoked the app.
  • These accounts have a total reach of over 97,000 followers.
  • In the 3 weeks of running the donated tweets system the number of new followers and mentions of Buddy Roemer have increased 3x over the 3 weeks before it was in place. The number of retweets of @BuddyRoemer has increased by 2.5x.

This is a huge multiplier effect. Sending donated tweets to 200 accounts is generating 2,000 to 3,000 mentions of the candidate each day. It is clear that we are not just inflating the numbers through our own tweeting. Each donated tweet is spawning new discussions among users who hadn’t heard of Buddy before.

We’ve been working on a college football recruiting site called DirectSnap.com for a couple of months, and the most interesting aspect of the technology behind this site is the quality control algorithm I had to develop. Most of the tweet streams we work on, such as 2012twit.com, are based on collecting tweets for either a set of screen names or real names that are distinctive, such as politicians. When you find a match for Newt Gingrich or Mitt Romney, you can be fairly sure you have the right person.

In the case of DirectSnap, the tweet collection is based on the first and last name of 250 high school football players. Right away I knew I would have a problem when I found Michael Moore in the list of potential recruits. Randy Johnson was going to be even trickier, since the baseball player with this name was likely to be tweeted about by the same sports fans as the football recruit we were tracking. Identifying college teams is also tricky. For example, the word ‘Florida’ in a tweet with a player’s name could refer to the University of Florida or Florida State University.

The solution I came up with was creating a list of exclusion keywords for each player and team. If a tweet contains ‘Michael Moore’, but it also has words like fat, hypocrite, film, or liberal, it probably is not about the football player. A tweet with a player’s name is assigned to the University of Florida if it contains ‘Florida’, but not ‘Florida State’. This first level of screening did a good job of filtering out false positives, such as the wrong Michael Moore, but we wanted to curate the tweets automatically to select the highest quality. The goal was to end up with a tweet stream that was much more interesting than what you could get with Twitter’s search.

To do this we added a set of high quality words to the quality screen, like the team position or hometown name of each player. We found that tweets with this extra information was generally from users who were serious about reporting details, not just random fans chanting a player’s name repeatedly. We used these quality words in two ways. Each time a quality word was found in a tweet, 1 point was added to a quality score for the tweet and for the user who sent the tweet. This allows us to select tweets for display that have a minimum quality score, and that are from a user with a minimum quality score.

To see how well this system works, try comparing the DirectSnap page for Michael Moore, and Twitter search for the same words. My experience is that users find false positives very upsetting. They think computers actually understand what they are searching for, and when they see a false positive, the reaction is always that the website is “stupid”. My favorite example of this is when people complain about Google Alerts for their own name returning blog posts or tweets they have written. The reaction is usually “How stupid can Google be? Doesn’t it know that I don’t want to be alerted about my own writing?” On the other hand, they never seem to be upset about missing results. So ending up with a subset of all possible matches, but with no visible false positives is always the best goal.

Exceeding the search API rate limit

by Adam Green on December 10, 2011

in Search API

We recently built a cool site called This R That for a client.

Besides having a great UI that my son, Zach, built, it also has a neat architecture for a Twitter search site. The major weakness of the Twitter search API is that rate limiting is based on the IP making the request. While Twitter won’t reveal the actual limit, it is believed to be about 200 an hour. If you build a web page that takes the search request and sends it to a server to do the work, that server’s IP will be capped at the rate limit across all users. A popular site would reach that limit fast.

The solution we used was to do the search with Javascript from within the user’s browser. Then we used Javascript to parse the JSON results and display them as tweet streams. With this model, the IP of the user’s computer is applied to the rate limit. So each user can do up to 200 search requests every hour, or more if Twitter is feeling generous. Any number of users can be running the same web page simultaneously.

I got a call the other day from a developer who was receiving various 500 series errors when trying to gather large amounts of Twitter user data. The API has a number of errors in the 500 range, all of which generally mean that the Twitter servers are overloaded. The API is built on the principle of staying alive while handling as many requests as possible. If the load gets too high or a request takes too long to process, the request is dumped and one of the 500 errors is returned.

The specific requirement for this developer was getting information on all the followers of his app’s users. He was doing this in a brute force fashion every 24 hours. First looking up all the followers 5,000 at a time with the /followers/ids call. Then getting the profile data for each of these followers 100 at a time with /users/lookup. This is a very intensive use of the API, and it is exactly what Twitter doesn’t want you to do. Look at the hint they are offering by returning 5,000 follower ids in a single call, but doling out profile data on only 100 users. They are telling us not to request too much user data.

Whenever possible you should be caching data you get from the API. User profiles are a perfect example. Instead of requesting data on every user every 24 hours, it is much better to store user profiles in a database, and request this data less often. Cutting back to once every 7 days reduces the number of API calls by 86%. I recommended that he adopt this type of caching and then check the user ids he receives from /followers/ids against the user database table. If the user is new or hasn’t been updated recently, then request the profile with /users/lookup.

It also helps to be opportunistic about caching. Many of the API calls return a user’s profile in the payload. If you get this data anywhere in your code, take advantage of this opportunity to cache it.

The other solution to 500 errors is to request less data each time. As I said, a 500 error is often a time out. While the /users/lookup call allows you to request 100 users at a time, try backing off to just 50 at a time. It will take more API calls, but you’ll have a better chance of getting results without an error. This type of logic should be built into your code. If a request triggers a 500 error, scale back the quantity requested and repeat the call.

New iPhone app based on the 2012twit database

October 26, 2011

We recently our first iPhone app based on the data we collect for the 2012twit.com site. I love the idea of reusing the same database as the back-end for multiple applications. We now use this database for the website, the iPhone app, and it provides data for our election analysis on the 140elect.com blog. This [...]

Read the full article →

Collecting #OWS tweets with the 140dev framework

October 24, 2011

Our work with Twitter and politics has now moved beyond the 2012 election. We just set up a tweet collection database to track the Occupy Wall Street movement. It uses the 140dev framework to collect all tweets containing #ows, #occupy, and #occupywalllstreet. This will be used to document our tools and methodology to automate a [...]

Read the full article →

Primer for Automated Twitter Engagement

October 18, 2011

A lot of my Twitter consulting for clients has involved automating the process of Twitter engagement. Now I’m finally documenting this as part of the Twitter politics work I’m doing with my son. We are building up a political account called @4more, and writing up the whole process on our 140elect.com blog.

Read the full article →

Download the latest version of Phirehose

September 30, 2011

Twitter has finally converted the Streaming API over to SSL connections only. This means that your copy of the 140dev Framework will stopping connecting until you upgrade to the latest version of the Phirehose library. You can do this in two ways. You can either download the entire 140dev Framework file for the DB Server [...]

Read the full article →

Join me at the Boston PHP Meetup

September 13, 2011

I’ll be giving a presentation on ‘Learning Twitter API Programming‘ this Wednesday night (September, 14th) at 7:00pm in Cambridge, Mass. My goal is to show the kind of sites that can be built, and the code architecture and database model needed. If you can’t come to the meeting, I’ve posted the notes for download.

Read the full article →

SocStudies.com: A new approach to college study with Twitter

September 1, 2011

The thing I find so great about doing Twitter consulting is the wide range of vertical applications that are completely open for new development. One of them is higher level education. We’ve just launched a new site called Social Studies that applies the techniques of tweet aggregation to help students study collectively. This work was [...]

Read the full article →