The Effects of API-Rate-Limits and Crawling Detection on Crawlability of Popular Social Media Networks
Getting good datasets for social media analytics research is hard. Service providers hardly release any data anymore and the few public datasets that exist are old and therefore outdated. Also the world of social media is changing rapidly which makes synthetic data useless.
Crawling your own data is a solution for this problem. However it is not known – considering rate limiting and possibly crawling detection – what the most time efficient way to collect such datasets is.
- Find out what the official API documentations and terms of service of Twitter, Facebook and Google+ say about rate limits and crawling
- Investigate related work.
- Build a crawler for each Twitter, Facebook and Google+ that uses the standard REST-APIs and crawls comparable datasets (Twitter: people someone follows; Facebook: friends; Google+: list of people someone has added to one or more circles). The goal is to crawl as much data as possible in a certain time frame using various methods trying to answer the introductory questions.
- Do the same using a screen scraper instead of the APIs.
In particular the following questions are of interest:
- Are there crawling countermeasures at all? Can you get blocked before reaching the rate limit of the API?
- Does the distribution of the requests matter for this?
- Do certain user agents get blocked more easily than others?
- Can you get more data by modifying your source identification?
- Using IPv6 does it matter if you change only your host identifier or change the whole address including the prefix?
- If you were to crawl the most amount of data in a given time frame, which methods would you use and which of the three popular social media networks (Facebook, Twitter, Google+) would you crawl?
- Jörg Daubert
Forschungsgebiete: privacy-trust, Telecooperation , – SPIN: Smart Protection in Infrastructures and Networks