• (Dis)belief dataset used in [ ICWSM’20 ].
    The dataset contains 6,000+ tweets annotated with belief and disbelief labels.

  • Social Media Posts dataset used in [ CSCW’18a ].
    The dataset contains 5,000+ social media posts and their veracity judged by Snopes or PolitiFact.

  • User Comments dataset used in [ CSCW’18a ] and [ ICWSM’20 ].
    The dataset contains 2,600,000+ social media comments in reply to above posts.
    facebook youtube twitter

  • ComLex lexicon used in [ CSCW’18a ].
    The lexicon contains 300 categories but only the top 56 named ones are human validated.

Content Moderation

  • YouTube Comments dataset used in [ ICWSM’19 ] and [ AAAI’20 ].
    The dataset contains 84,000+ YouTube comments and their annotations described in our papers.


  • Fact-Checks dataset used in [ WWW’20 ].
    The dataset contains 6,000+ URLs of fact-checks and their claims, claimants and verdicts.
    external link


  • Drivers’ Trajectories dataset used in [ WWW’18 ].
    Unfortunately, due to Uber’s and Lyft’s Terms of Service, the dataset is not available to the public. A visualization of Uber and Lyft drivers using this dataset is made public by the San Francisco County Transportation Authority.
    external link