Twitter API Web Scraping & Data Analysis
__author__ = "Donald Ghazi"
__email__ = "donald@donaldghazi.com"
__website__ = "donaldghazi.com"
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests
import tweepy
import json
# load given twitter archive
archive = pd.read_csv('twitter-archive-enhanced.csv')
# count number of tweets from the archive (number of tweets with ratings only)
archive.shape[0]
2356
# sort and set df showing latest tweets first
archive.sort_values('timestamp',ascending=False,inplace=True)
archive.head() #inspect the data
tweet_id | in_reply_to_status_id | in_reply_to_user_id | timestamp | source | text | retweeted_status_id | retweeted_status_user_id | retweeted_status_timestamp | expanded_urls | rating_numerator | rating_denominator | name | doggo | floofer | pupper | puppo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 892420643555336193 | NaN | NaN | 2017-08-01 16:23:56 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Phineas. He's a mystical boy. Only eve... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/892420643... | 13 | 10 | Phineas | None | None | None | None |
1 | 892177421306343426 | NaN | NaN | 2017-08-01 00:17:27 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Tilly. She's just checking pup on you.... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/892177421... | 13 | 10 | Tilly | None | None | None | None |
2 | 891815181378084864 | NaN | NaN | 2017-07-31 00:18:03 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Archie. He is a rare Norwegian Pouncin... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891815181... | 12 | 10 | Archie | None | None | None | None |
3 | 891689557279858688 | NaN | NaN | 2017-07-30 15:58:51 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Darla. She commenced a snooze mid meal... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891689557... | 13 | 10 | Darla | None | None | None | None |
4 | 891327558926688256 | NaN | NaN | 2017-07-29 16:00:24 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Franklin. He would like you to stop ca... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891327558... | 12 | 10 | Franklin | None | None | None | None |
# download image predictions from Udacity's server
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
with open(url.split('/')[-1], mode='wb') as file:
file.write(response.content)
# import and load the tsv file by setting the separator to (tab) into a df called images
images = pd.read_csv('image-predictions.tsv', sep='\t')
# check to see if it was imported correctly
images.head()
tweet_id | jpg_url | img_num | p1 | p1_conf | p1_dog | p2 | p2_conf | p2_dog | p3 | p3_conf | p3_dog | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 666020888022790149 | https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg | 1 | Welsh_springer_spaniel | 0.465074 | True | collie | 0.156665 | True | Shetland_sheepdog | 0.061428 | True |
1 | 666029285002620928 | https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg | 1 | redbone | 0.506826 | True | miniature_pinscher | 0.074192 | True | Rhodesian_ridgeback | 0.072010 | True |
2 | 666033412701032449 | https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg | 1 | German_shepherd | 0.596461 | True | malinois | 0.138584 | True | bloodhound | 0.116197 | True |
3 | 666044226329800704 | https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg | 1 | Rhodesian_ridgeback | 0.408143 | True | redbone | 0.360687 | True | miniature_pinscher | 0.222752 | True |
4 | 666049248165822465 | https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg | 1 | miniature_pinscher | 0.560311 | True | Rottweiler | 0.243682 | True | Doberman | 0.154629 | True |
# count tweets from image-predictions.tsv (number of tweets with images)
images.shape[0]
2075
# set keys to establish connection (*removed for submission)
consumer_key = 'XXXXX'
consumer_secret = 'XXXXX'
access_token = 'XXXXX'
access_token_secret = 'XXXX'
# configure connection (http://docs.tweepy.org/en/v3.5.0/auth_tutorial.html)
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
# create an API object
api = tweepy.API(auth,
parser = tweepy.parsers.JSONParser(),
wait_on_rate_limit = True,
wait_on_rate_limit_notify = True)
# set 2 lists and download Tweepy Status objects (items) via API
# list contains tweets w/ tweet ids
tweets_list = []
# list contains ids w/o tweets
no_tweets_list = []
count = 0
for tweet_id in archive['tweet_id']:
count += 1
try:
tweets_list.append(api.get_status(tweet_id))
except Exception as e:
no_tweets_list.append(tweet_id)
# keep track of items processed (should say 2350 @ the end)
if count % 50 == 0:
print(str(count)+ " items processed ")
50 items processed 100 items processed 150 items processed 200 items processed 250 items processed 300 items processed 350 items processed 400 items processed 450 items processed 500 items processed 550 items processed 600 items processed 650 items processed 700 items processed 750 items processed 800 items processed 850 items processed 900 items processed Rate limit reached. Sleeping for: 732 950 items processed 1000 items processed 1050 items processed 1100 items processed 1150 items processed 1200 items processed 1250 items processed 1300 items processed 1350 items processed 1400 items processed 1450 items processed 1500 items processed 1550 items processed 1600 items processed 1650 items processed 1700 items processed 1750 items processed 1800 items processed Rate limit reached. Sleeping for: 734 1850 items processed 1900 items processed 1950 items processed 2000 items processed 2050 items processed 2100 items processed 2150 items processed 2200 items processed 2250 items processed 2300 items processed 2350 items processed
# https://stackabuse.com/writing-files-using-python/
# create tweet_json.txt
with open('tweet_json.txt', 'w') as outfile: # load JSON data we gathered
for tweet_json in tweets_list:
json.dump(tweet_json, outfile) # write data (what we constructed) to outfile
outfile.write('\n') # save JSON data as tweets_json.txt file
#https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object (for attributes)
# create a list of attributes
attr_list = []
# open the text file and read as JSON file
with open('tweet_json.txt', 'r') as json_file:
# read the first line for the loop to start below
line = json_file.readline()
# create loop to get retweets and their links
while line:
json_data = json.loads(line)
retweeted_status = json_data['retweeted_status'] = json_data.get('retweeted_status', 'original')
if retweeted_status == 'original':
url = json_data['text'][json_data['text'].find('https'):]
else:
retweeted_status = 'retweet'
url = 'retweet'
# create a dictionary for the selected attributes
attributes = {'tweet_id': str(json_data['id']),
'favorite_count': int(json_data['favorite_count']),
'retweet_count': int(json_data['retweet_count']),
'followers_count': int(json_data['user']['followers_count']),
'friends_count': int(json_data['user']['friends_count']),
'url': url,
'retweeted_status': str(retweeted_status)}
# add above to our list
attr_list.append(attributes)
# read the next line of JSON data
line = json_file.readline()
# convert the tweet JSON data dictionary list to a pandas DataFrame
api_data = pd.DataFrame(attr_list,
columns = ['tweet_id',
'favorite_count',
'retweet_count',
'followers_count',
'friends_count',
'url',
'retweeted_status'])
#inspect
api_data.head()
tweet_id | favorite_count | retweet_count | followers_count | friends_count | url | retweeted_status | |
---|---|---|---|---|---|---|---|
0 | 892420643555336193 | 36357 | 7738 | 8767944 | 15 | https://t.co/MgUWQ76dJU | original |
1 | 892177421306343426 | 31325 | 5728 | 8767944 | 15 | https://t.co/aQFSeaCu9L | original |
2 | 891815181378084864 | 23596 | 3789 | 8767944 | 15 | https://t.co/r0YlrsGCgy | original |
3 | 891689557279858688 | 39665 | 7901 | 8767944 | 15 | https://t.co/tD36da7qLQ | original |
4 | 891327558926688256 | 37866 | 8523 | 8767944 | 15 | https://t.co/0g0KMIVXZ3 | original |
As I've gathered each of the above pieces of data, I will assess them visually and programmatically for quality and tidiness issues.
archive
tweet_id | in_reply_to_status_id | in_reply_to_user_id | timestamp | source | text | retweeted_status_id | retweeted_status_user_id | retweeted_status_timestamp | expanded_urls | rating_numerator | rating_denominator | name | doggo | floofer | pupper | puppo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 892420643555336193 | NaN | NaN | 2017-08-01 16:23:56 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Phineas. He's a mystical boy. Only eve... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/892420643... | 13 | 10 | Phineas | None | None | None | None |
1 | 892177421306343426 | NaN | NaN | 2017-08-01 00:17:27 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Tilly. She's just checking pup on you.... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/892177421... | 13 | 10 | Tilly | None | None | None | None |
2 | 891815181378084864 | NaN | NaN | 2017-07-31 00:18:03 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Archie. He is a rare Norwegian Pouncin... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891815181... | 12 | 10 | Archie | None | None | None | None |
3 | 891689557279858688 | NaN | NaN | 2017-07-30 15:58:51 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Darla. She commenced a snooze mid meal... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891689557... | 13 | 10 | Darla | None | None | None | None |
4 | 891327558926688256 | NaN | NaN | 2017-07-29 16:00:24 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Franklin. He would like you to stop ca... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891327558... | 12 | 10 | Franklin | None | None | None | None |
5 | 891087950875897856 | NaN | NaN | 2017-07-29 00:08:17 +0000 | <a href="http://twitter.com/download/iphone" r... | Here we have a majestic great white breaching ... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891087950... | 13 | 10 | None | None | None | None | None |
6 | 890971913173991426 | NaN | NaN | 2017-07-28 16:27:12 +0000 | <a href="http://twitter.com/download/iphone" r... | Meet Jax. He enjoys ice cream so much he gets ... | NaN | NaN | NaN | https://gofundme.com/ydvmve-surgery-for-jax,ht... | 13 | 10 | Jax | None | None | None | None |
7 | 890729181411237888 | NaN | NaN | 2017-07-28 00:22:40 +0000 | <a href="http://twitter.com/download/iphone" r... | When you watch your owner call another dog a g... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/890729181... | 13 | 10 | None | None | None | None | None |
8 | 890609185150312448 | NaN | NaN | 2017-07-27 16:25:51 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Zoey. She doesn't want to be one of th... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/890609185... | 13 | 10 | Zoey | None | None | None | None |
9 | 890240255349198849 | NaN | NaN | 2017-07-26 15:59:51 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Cassie. She is a college pup. Studying... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/890240255... | 14 | 10 | Cassie | doggo | None | None | None |
10 | 890006608113172480 | NaN | NaN | 2017-07-26 00:31:25 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Koda. He is a South Australian decksha... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/890006608... | 13 | 10 | Koda | None | None | None | None |
11 | 889880896479866881 | NaN | NaN | 2017-07-25 16:11:53 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Bruno. He is a service shark. Only get... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/889880896... | 13 | 10 | Bruno | None | None | None | None |
12 | 889665388333682689 | NaN | NaN | 2017-07-25 01:55:32 +0000 | <a href="http://twitter.com/download/iphone" r... | Here's a puppo that seems to be on the fence a... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/889665388... | 13 | 10 | None | None | None | None | puppo |
13 | 889638837579907072 | NaN | NaN | 2017-07-25 00:10:02 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Ted. He does his best. Sometimes that'... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/889638837... | 12 | 10 | Ted | None | None | None | None |
14 | 889531135344209921 | NaN | NaN | 2017-07-24 17:02:04 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Stuart. He's sporting his favorite fan... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/889531135... | 13 | 10 | Stuart | None | None | None | puppo |
15 | 889278841981685760 | NaN | NaN | 2017-07-24 00:19:32 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Oliver. You're witnessing one of his m... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/889278841... | 13 | 10 | Oliver | None | None | None | None |
16 | 888917238123831296 | NaN | NaN | 2017-07-23 00:22:39 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Jim. He found a fren. Taught him how t... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/888917238... | 12 | 10 | Jim | None | None | None | None |
17 | 888804989199671297 | NaN | NaN | 2017-07-22 16:56:37 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Zeke. He has a new stick. Very proud o... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/888804989... | 13 | 10 | Zeke | None | None | None | None |
18 | 888554962724278272 | NaN | NaN | 2017-07-22 00:23:06 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Ralphus. He's powering up. Attempting ... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/888554962... | 13 | 10 | Ralphus | None | None | None | None |
19 | 888202515573088257 | NaN | NaN | 2017-07-21 01:02:36 +0000 | <a href="http://twitter.com/download/iphone" r... | RT @dog_rates: This is Canela. She attempted s... | 8.874740e+17 | 4.196984e+09 | 2017-07-19 00:47:34 +0000 | https://twitter.com/dog_rates/status/887473957... | 13 | 10 | Canela | None | None | None | None |
20 | 888078434458587136 | NaN | NaN | 2017-07-20 16:49:33 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Gerald. He was just told he didn't get... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/888078434... | 12 | 10 | Gerald | None | None | None | None |
21 | 887705289381826560 | NaN | NaN | 2017-07-19 16:06:48 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Jeffrey. He has a monopoly on the pool... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/887705289... | 13 | 10 | Jeffrey | None | None | None | None |
22 | 887517139158093824 | NaN | NaN | 2017-07-19 03:39:09 +0000 | <a href="http://twitter.com/download/iphone" r... | I've yet to rate a Venezuelan Hover Wiener. Th... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/887517139... | 14 | 10 | such | None | None | None | None |
23 | 887473957103951883 | NaN | NaN | 2017-07-19 00:47:34 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Canela. She attempted some fancy porch... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/887473957... | 13 | 10 | Canela | None | None | None | None |
24 | 887343217045368832 | NaN | NaN | 2017-07-18 16:08:03 +0000 | <a href="http://twitter.com/download/iphone" r... | You may not have known you needed to see this ... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/887343217... | 13 | 10 | None | None | None | None | None |
25 | 887101392804085760 | NaN | NaN | 2017-07-18 00:07:08 +0000 | <a href="http://twitter.com/download/iphone" r... | This... is a Jubilant Antarctic House Bear. We... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/887101392... | 12 | 10 | None | None | None | None | None |
26 | 886983233522544640 | NaN | NaN | 2017-07-17 16:17:36 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Maya. She's very shy. Rarely leaves he... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/886983233... | 13 | 10 | Maya | None | None | None | None |
27 | 886736880519319552 | NaN | NaN | 2017-07-16 23:58:41 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Mingus. He's a wonderful father to his... | NaN | NaN | NaN | https://www.gofundme.com/mingusneedsus,https:/... | 13 | 10 | Mingus | None | None | None | None |
28 | 886680336477933568 | NaN | NaN | 2017-07-16 20:14:00 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Derek. He's late for a dog meeting. 13... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/886680336... | 13 | 10 | Derek | None | None | None | None |
29 | 886366144734445568 | NaN | NaN | 2017-07-15 23:25:31 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Roscoe. Another pupper fallen victim t... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/886366144... | 12 | 10 | Roscoe | None | None | pupper | None |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2326 | 666411507551481857 | NaN | NaN | 2015-11-17 00:24:19 +0000 | <a href="http://twitter.com/download/iphone" r... | This is quite the dog. Gets really excited whe... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666411507... | 2 | 10 | quite | None | None | None | None |
2327 | 666407126856765440 | NaN | NaN | 2015-11-17 00:06:54 +0000 | <a href="http://twitter.com/download/iphone" r... | This is a southern Vesuvius bumblegruff. Can d... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666407126... | 7 | 10 | a | None | None | None | None |
2328 | 666396247373291520 | NaN | NaN | 2015-11-16 23:23:41 +0000 | <a href="http://twitter.com/download/iphone" r... | Oh goodness. A super rare northeast Qdoba kang... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666396247... | 9 | 10 | None | None | None | None | None |
2329 | 666373753744588802 | NaN | NaN | 2015-11-16 21:54:18 +0000 | <a href="http://twitter.com/download/iphone" r... | Those are sunglasses and a jean jacket. 11/10 ... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666373753... | 11 | 10 | None | None | None | None | None |
2330 | 666362758909284353 | NaN | NaN | 2015-11-16 21:10:36 +0000 | <a href="http://twitter.com/download/iphone" r... | Unique dog here. Very small. Lives in containe... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666362758... | 6 | 10 | None | None | None | None | None |
2331 | 666353288456101888 | NaN | NaN | 2015-11-16 20:32:58 +0000 | <a href="http://twitter.com/download/iphone" r... | Here we have a mixed Asiago from the Galápagos... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666353288... | 8 | 10 | None | None | None | None | None |
2332 | 666345417576210432 | NaN | NaN | 2015-11-16 20:01:42 +0000 | <a href="http://twitter.com/download/iphone" r... | Look at this jokester thinking seat belt laws ... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666345417... | 10 | 10 | None | None | None | None | None |
2333 | 666337882303524864 | NaN | NaN | 2015-11-16 19:31:45 +0000 | <a href="http://twitter.com/download/iphone" r... | This is an extremely rare horned Parthenon. No... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666337882... | 9 | 10 | an | None | None | None | None |
2334 | 666293911632134144 | NaN | NaN | 2015-11-16 16:37:02 +0000 | <a href="http://twitter.com/download/iphone" r... | This is a funny dog. Weird toes. Won't come do... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666293911... | 3 | 10 | a | None | None | None | None |
2335 | 666287406224695296 | NaN | NaN | 2015-11-16 16:11:11 +0000 | <a href="http://twitter.com/download/iphone" r... | This is an Albanian 3 1/2 legged Episcopalian... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666287406... | 1 | 2 | an | None | None | None | None |
2336 | 666273097616637952 | NaN | NaN | 2015-11-16 15:14:19 +0000 | <a href="http://twitter.com/download/iphone" r... | Can take selfies 11/10 https://t.co/ws2AMaNwPW | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666273097... | 11 | 10 | None | None | None | None | None |
2337 | 666268910803644416 | NaN | NaN | 2015-11-16 14:57:41 +0000 | <a href="http://twitter.com/download/iphone" r... | Very concerned about fellow dog trapped in com... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666268910... | 10 | 10 | None | None | None | None | None |
2338 | 666104133288665088 | NaN | NaN | 2015-11-16 04:02:55 +0000 | <a href="http://twitter.com/download/iphone" r... | Not familiar with this breed. No tail (weird).... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666104133... | 1 | 10 | None | None | None | None | None |
2339 | 666102155909144576 | NaN | NaN | 2015-11-16 03:55:04 +0000 | <a href="http://twitter.com/download/iphone" r... | Oh my. Here you are seeing an Adobe Setter giv... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666102155... | 11 | 10 | None | None | None | None | None |
2340 | 666099513787052032 | NaN | NaN | 2015-11-16 03:44:34 +0000 | <a href="http://twitter.com/download/iphone" r... | Can stand on stump for what seems like a while... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666099513... | 8 | 10 | None | None | None | None | None |
2341 | 666094000022159362 | NaN | NaN | 2015-11-16 03:22:39 +0000 | <a href="http://twitter.com/download/iphone" r... | This appears to be a Mongolian Presbyterian mi... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666094000... | 9 | 10 | None | None | None | None | None |
2342 | 666082916733198337 | NaN | NaN | 2015-11-16 02:38:37 +0000 | <a href="http://twitter.com/download/iphone" r... | Here we have a well-established sunblockerspan... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666082916... | 6 | 10 | None | None | None | None | None |
2343 | 666073100786774016 | NaN | NaN | 2015-11-16 01:59:36 +0000 | <a href="http://twitter.com/download/iphone" r... | Let's hope this flight isn't Malaysian (lol). ... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666073100... | 10 | 10 | None | None | None | None | None |
2344 | 666071193221509120 | NaN | NaN | 2015-11-16 01:52:02 +0000 | <a href="http://twitter.com/download/iphone" r... | Here we have a northern speckled Rhododendron.... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666071193... | 9 | 10 | None | None | None | None | None |
2345 | 666063827256086533 | NaN | NaN | 2015-11-16 01:22:45 +0000 | <a href="http://twitter.com/download/iphone" r... | This is the happiest dog you will ever see. Ve... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666063827... | 10 | 10 | the | None | None | None | None |
2346 | 666058600524156928 | NaN | NaN | 2015-11-16 01:01:59 +0000 | <a href="http://twitter.com/download/iphone" r... | Here is the Rand Paul of retrievers folks! He'... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666058600... | 8 | 10 | the | None | None | None | None |
2347 | 666057090499244032 | NaN | NaN | 2015-11-16 00:55:59 +0000 | <a href="http://twitter.com/download/iphone" r... | My oh my. This is a rare blond Canadian terrie... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666057090... | 9 | 10 | a | None | None | None | None |
2348 | 666055525042405380 | NaN | NaN | 2015-11-16 00:49:46 +0000 | <a href="http://twitter.com/download/iphone" r... | Here is a Siberian heavily armored polar bear ... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666055525... | 10 | 10 | a | None | None | None | None |
2349 | 666051853826850816 | NaN | NaN | 2015-11-16 00:35:11 +0000 | <a href="http://twitter.com/download/iphone" r... | This is an odd dog. Hard on the outside but lo... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666051853... | 2 | 10 | an | None | None | None | None |
2350 | 666050758794694657 | NaN | NaN | 2015-11-16 00:30:50 +0000 | <a href="http://twitter.com/download/iphone" r... | This is a truly beautiful English Wilson Staff... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666050758... | 10 | 10 | a | None | None | None | None |
2351 | 666049248165822465 | NaN | NaN | 2015-11-16 00:24:50 +0000 | <a href="http://twitter.com/download/iphone" r... | Here we have a 1949 1st generation vulpix. Enj... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666049248... | 5 | 10 | None | None | None | None | None |
2352 | 666044226329800704 | NaN | NaN | 2015-11-16 00:04:52 +0000 | <a href="http://twitter.com/download/iphone" r... | This is a purebred Piers Morgan. Loves to Netf... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666044226... | 6 | 10 | a | None | None | None | None |
2353 | 666033412701032449 | NaN | NaN | 2015-11-15 23:21:54 +0000 | <a href="http://twitter.com/download/iphone" r... | Here is a very happy pup. Big fan of well-main... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666033412... | 9 | 10 | a | None | None | None | None |
2354 | 666029285002620928 | NaN | NaN | 2015-11-15 23:05:30 +0000 | <a href="http://twitter.com/download/iphone" r... | This is a western brown Mitsubishi terrier. Up... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666029285... | 7 | 10 | a | None | None | None | None |
2355 | 666020888022790149 | NaN | NaN | 2015-11-15 22:32:08 +0000 | <a href="http://twitter.com/download/iphone" r... | Here we have a Japanese Irish Setter. Lost eye... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666020888... | 8 | 10 | None | None | None | None | None |
2356 rows × 17 columns
api_data
tweet_id | favorite_count | retweet_count | followers_count | friends_count | url | retweeted_status | |
---|---|---|---|---|---|---|---|
0 | 892420643555336193 | 36357 | 7738 | 8767944 | 15 | https://t.co/MgUWQ76dJU | original |
1 | 892177421306343426 | 31325 | 5728 | 8767944 | 15 | https://t.co/aQFSeaCu9L | original |
2 | 891815181378084864 | 23596 | 3789 | 8767944 | 15 | https://t.co/r0YlrsGCgy | original |
3 | 891689557279858688 | 39665 | 7901 | 8767944 | 15 | https://t.co/tD36da7qLQ | original |
4 | 891327558926688256 | 37866 | 8523 | 8767944 | 15 | https://t.co/0g0KMIVXZ3 | original |
5 | 891087950875897856 | 19070 | 2852 | 8767944 | 15 | https://t.co/xx5cilW0Dd | original |
6 | 890971913173991426 | 11108 | 1858 | 8767944 | 15 | https://t.co/MV01Q820LT | original |
7 | 890729181411237888 | 61264 | 17266 | 8767944 | 15 | https://t.co/hrcFOGi12V | original |
8 | 890609185150312448 | 26248 | 3933 | 8767944 | 15 | https://t.co/UkrdQyoYxV | original |
9 | 890240255349198849 | 29985 | 6728 | 8767944 | 15 | https://t.co/l3TSS3o2M0 | original |
10 | 890006608113172480 | 28878 | 6719 | 8767944 | 15 | https://t.co/jy2alqaYPH | original |
11 | 889880896479866881 | 26239 | 4569 | 8767944 | 15 | https://t.co/u1XPQMl29g | original |
12 | 889665388333682689 | 45211 | 9164 | 8767944 | 15 | https://t.co/BxvuXk0UCm | original |
13 | 889638837579907072 | 25448 | 4114 | 8767944 | 15 | https://t.co/f8dEDcrKSR | original |
14 | 889531135344209921 | 14233 | 2063 | 8767944 | 15 | https://t.co/Ya3bdfYuCx | original |
15 | 889278841981685760 | 23712 | 4883 | 8767944 | 15 | https://t.co/qw5xRwBaIv | original |
16 | 888917238123831296 | 27406 | 4123 | 8767944 | 15 | https://t.co/chxruIOUJN | original |
17 | 888804989199671297 | 24076 | 3888 | 8767944 | 15 | https://t.co/L5cCoR2huQ | original |
18 | 888554962724278272 | 18597 | 3190 | 8767944 | 15 | https://t.co/YnYAFCTTiK | original |
19 | 888078434458587136 | 20501 | 3170 | 8767944 | 15 | https://t.co/UibDBnVpvR | original |
20 | 887705289381826560 | 28459 | 4932 | 8767944 | 15 | https://t.co/m7BR4JgDrZ | original |
21 | 887517139158093824 | 43667 | 10784 | 8767944 | 15 | https://t.co/20VrLAA8ba | original |
22 | 887473957103951883 | 64740 | 16494 | 8767944 | 15 | https://t.co/cLyzpcUcMX | original |
23 | 887343217045368832 | 31676 | 9585 | 8767944 | 15 | https://t.co/WZqNqygEyV | original |
24 | 887101392804085760 | 28825 | 5456 | 8767944 | 15 | https://t.co/UgW2xbO7Rg | original |
25 | 886983233522544640 | 32909 | 7025 | 8767944 | 15 | https://t.co/I6oNy0CgiT | original |
26 | 886736880519319552 | 11293 | 2945 | 8767944 | 15 | https://t.co/RipKYZ0T1u | original |
27 | 886680336477933568 | 21159 | 4090 | 8767944 | 15 | https://t.co/BCoWue0abA | original |
28 | 886366144734445568 | 19915 | 2900 | 8767944 | 15 | https://t.co/iIu3AF8IWH | original |
29 | 886267009285017600 | 113 | 4 | 8767944 | 15 | n | original |
... | ... | ... | ... | ... | ... | ... | ... |
2301 | 666411507551481857 | 414 | 302 | 8767948 | 15 | https://t.co/aMCTNWO94t | original |
2302 | 666407126856765440 | 100 | 32 | 8767948 | 15 | https://t.co/LopTBkKa8h | original |
2303 | 666396247373291520 | 159 | 76 | 8767948 | 15 | https://t.co/Dc7b0E8qFE | original |
2304 | 666373753744588802 | 174 | 85 | 8767948 | 15 | https://t.co/uHXrPkUEyl | original |
2305 | 666362758909284353 | 729 | 522 | 8767948 | 15 | https://t.co/XMD9CwjEnM | original |
2306 | 666353288456101888 | 200 | 66 | 8767948 | 15 | https://t.co/tltQ5w9aUO | original |
2307 | 666345417576210432 | 275 | 128 | 8767948 | 15 | https://t.co/VFKG1vxGjB | original |
2308 | 666337882303524864 | 182 | 84 | 8767948 | 15 | https://t.co/QpRjllzWAL | original |
2309 | 666293911632134144 | 469 | 322 | 8767948 | 15 | https://t.co/IIXis0zta0 | original |
2310 | 666287406224695296 | 138 | 62 | 8767948 | 15 | https://t.co/d9NcXFKwLv | original |
2311 | 666273097616637952 | 161 | 73 | 8767948 | 15 | https://t.co/ws2AMaNwPW | original |
2312 | 666268910803644416 | 96 | 32 | 8767948 | 15 | https://t.co/0yxApIikpk | original |
2313 | 666104133288665088 | 13711 | 6011 | 8767948 | 15 | https://t.co/Asgdc6kuLX | original |
2314 | 666102155909144576 | 72 | 11 | 8767948 | 15 | https://t.co/11LvqN4WLq | original |
2315 | 666099513787052032 | 143 | 61 | 8767948 | 15 | https://t.co/Ri4nMTLq5C | original |
2316 | 666094000022159362 | 154 | 68 | 8767948 | 15 | https://t.co/mnioXo3IfP | original |
2317 | 666082916733198337 | 103 | 42 | 8767948 | 15 | https://t.co/3RU6x0vHB7 | original |
2318 | 666073100786774016 | 298 | 147 | 8767948 | 15 | https://t.co/Yk6GHE9tOY | original |
2319 | 666071193221509120 | 136 | 54 | 8767948 | 15 | https://t.co/ZoL8kq2XFx | original |
2320 | 666063827256086533 | 450 | 199 | 8767948 | 15 | https://t.co/RhUEAloehK | original |
2321 | 666058600524156928 | 105 | 54 | 8767948 | 15 | https://t.co/pYAJkAe76p | original |
2322 | 666057090499244032 | 270 | 128 | 8767948 | 15 | https://t.co/yWBqbrzy8O | original |
2323 | 666055525042405380 | 411 | 222 | 8767948 | 15 | https://t.co/rdivxLiqEt | original |
2324 | 666051853826850816 | 1129 | 781 | 8767948 | 15 | https://t.co/v5A4vzSDdc | original |
2325 | 666050758794694657 | 124 | 51 | 8767948 | 15 | https://t.co/fvIbQfHjIe | original |
2326 | 666049248165822465 | 96 | 41 | 8767948 | 15 | https://t.co/4B7cOc1EDq | original |
2327 | 666044226329800704 | 273 | 132 | 8767948 | 15 | https://t.co/DWnyCjf2mx | original |
2328 | 666033412701032449 | 112 | 41 | 8767948 | 15 | https://t.co/y671yMhoiR | original |
2329 | 666029285002620928 | 121 | 42 | 8767948 | 15 | https://t.co/r7mOb2m0UI | original |
2330 | 666020888022790149 | 2424 | 465 | 8767948 | 15 | https://t.co/BLDqew2Ijj | original |
2331 rows × 7 columns
images
tweet_id | jpg_url | img_num | p1 | p1_conf | p1_dog | p2 | p2_conf | p2_dog | p3 | p3_conf | p3_dog | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 666020888022790149 | https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg | 1 | Welsh_springer_spaniel | 0.465074 | True | collie | 0.156665 | True | Shetland_sheepdog | 0.061428 | True |
1 | 666029285002620928 | https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg | 1 | redbone | 0.506826 | True | miniature_pinscher | 0.074192 | True | Rhodesian_ridgeback | 0.072010 | True |
2 | 666033412701032449 | https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg | 1 | German_shepherd | 0.596461 | True | malinois | 0.138584 | True | bloodhound | 0.116197 | True |
3 | 666044226329800704 | https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg | 1 | Rhodesian_ridgeback | 0.408143 | True | redbone | 0.360687 | True | miniature_pinscher | 0.222752 | True |
4 | 666049248165822465 | https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg | 1 | miniature_pinscher | 0.560311 | True | Rottweiler | 0.243682 | True | Doberman | 0.154629 | True |
5 | 666050758794694657 | https://pbs.twimg.com/media/CT5Jof1WUAEuVxN.jpg | 1 | Bernese_mountain_dog | 0.651137 | True | English_springer | 0.263788 | True | Greater_Swiss_Mountain_dog | 0.016199 | True |
6 | 666051853826850816 | https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg | 1 | box_turtle | 0.933012 | False | mud_turtle | 0.045885 | False | terrapin | 0.017885 | False |
7 | 666055525042405380 | https://pbs.twimg.com/media/CT5N9tpXIAAifs1.jpg | 1 | chow | 0.692517 | True | Tibetan_mastiff | 0.058279 | True | fur_coat | 0.054449 | False |
8 | 666057090499244032 | https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg | 1 | shopping_cart | 0.962465 | False | shopping_basket | 0.014594 | False | golden_retriever | 0.007959 | True |
9 | 666058600524156928 | https://pbs.twimg.com/media/CT5Qw94XAAA_2dP.jpg | 1 | miniature_poodle | 0.201493 | True | komondor | 0.192305 | True | soft-coated_wheaten_terrier | 0.082086 | True |
10 | 666063827256086533 | https://pbs.twimg.com/media/CT5Vg_wXIAAXfnj.jpg | 1 | golden_retriever | 0.775930 | True | Tibetan_mastiff | 0.093718 | True | Labrador_retriever | 0.072427 | True |
11 | 666071193221509120 | https://pbs.twimg.com/media/CT5cN_3WEAAlOoZ.jpg | 1 | Gordon_setter | 0.503672 | True | Yorkshire_terrier | 0.174201 | True | Pekinese | 0.109454 | True |
12 | 666073100786774016 | https://pbs.twimg.com/media/CT5d9DZXAAALcwe.jpg | 1 | Walker_hound | 0.260857 | True | English_foxhound | 0.175382 | True | Ibizan_hound | 0.097471 | True |
13 | 666082916733198337 | https://pbs.twimg.com/media/CT5m4VGWEAAtKc8.jpg | 1 | pug | 0.489814 | True | bull_mastiff | 0.404722 | True | French_bulldog | 0.048960 | True |
14 | 666094000022159362 | https://pbs.twimg.com/media/CT5w9gUW4AAsBNN.jpg | 1 | bloodhound | 0.195217 | True | German_shepherd | 0.078260 | True | malinois | 0.075628 | True |
15 | 666099513787052032 | https://pbs.twimg.com/media/CT51-JJUEAA6hV8.jpg | 1 | Lhasa | 0.582330 | True | Shih-Tzu | 0.166192 | True | Dandie_Dinmont | 0.089688 | True |
16 | 666102155909144576 | https://pbs.twimg.com/media/CT54YGiWUAEZnoK.jpg | 1 | English_setter | 0.298617 | True | Newfoundland | 0.149842 | True | borzoi | 0.133649 | True |
17 | 666104133288665088 | https://pbs.twimg.com/media/CT56LSZWoAAlJj2.jpg | 1 | hen | 0.965932 | False | cock | 0.033919 | False | partridge | 0.000052 | False |
18 | 666268910803644416 | https://pbs.twimg.com/media/CT8QCd1WEAADXws.jpg | 1 | desktop_computer | 0.086502 | False | desk | 0.085547 | False | bookcase | 0.079480 | False |
19 | 666273097616637952 | https://pbs.twimg.com/media/CT8T1mtUwAA3aqm.jpg | 1 | Italian_greyhound | 0.176053 | True | toy_terrier | 0.111884 | True | basenji | 0.111152 | True |
20 | 666287406224695296 | https://pbs.twimg.com/media/CT8g3BpUEAAuFjg.jpg | 1 | Maltese_dog | 0.857531 | True | toy_poodle | 0.063064 | True | miniature_poodle | 0.025581 | True |
21 | 666293911632134144 | https://pbs.twimg.com/media/CT8mx7KW4AEQu8N.jpg | 1 | three-toed_sloth | 0.914671 | False | otter | 0.015250 | False | great_grey_owl | 0.013207 | False |
22 | 666337882303524864 | https://pbs.twimg.com/media/CT9OwFIWEAMuRje.jpg | 1 | ox | 0.416669 | False | Newfoundland | 0.278407 | True | groenendael | 0.102643 | True |
23 | 666345417576210432 | https://pbs.twimg.com/media/CT9Vn7PWoAA_ZCM.jpg | 1 | golden_retriever | 0.858744 | True | Chesapeake_Bay_retriever | 0.054787 | True | Labrador_retriever | 0.014241 | True |
24 | 666353288456101888 | https://pbs.twimg.com/media/CT9cx0tUEAAhNN_.jpg | 1 | malamute | 0.336874 | True | Siberian_husky | 0.147655 | True | Eskimo_dog | 0.093412 | True |
25 | 666362758909284353 | https://pbs.twimg.com/media/CT9lXGsUcAAyUFt.jpg | 1 | guinea_pig | 0.996496 | False | skunk | 0.002402 | False | hamster | 0.000461 | False |
26 | 666373753744588802 | https://pbs.twimg.com/media/CT9vZEYWUAAlZ05.jpg | 1 | soft-coated_wheaten_terrier | 0.326467 | True | Afghan_hound | 0.259551 | True | briard | 0.206803 | True |
27 | 666396247373291520 | https://pbs.twimg.com/media/CT-D2ZHWIAA3gK1.jpg | 1 | Chihuahua | 0.978108 | True | toy_terrier | 0.009397 | True | papillon | 0.004577 | True |
28 | 666407126856765440 | https://pbs.twimg.com/media/CT-NvwmW4AAugGZ.jpg | 1 | black-and-tan_coonhound | 0.529139 | True | bloodhound | 0.244220 | True | flat-coated_retriever | 0.173810 | True |
29 | 666411507551481857 | https://pbs.twimg.com/media/CT-RugiWIAELEaq.jpg | 1 | coho | 0.404640 | False | barracouta | 0.271485 | False | gar | 0.189945 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2045 | 886366144734445568 | https://pbs.twimg.com/media/DE0BTnQUwAApKEH.jpg | 1 | French_bulldog | 0.999201 | True | Chihuahua | 0.000361 | True | Boston_bull | 0.000076 | True |
2046 | 886680336477933568 | https://pbs.twimg.com/media/DE4fEDzWAAAyHMM.jpg | 1 | convertible | 0.738995 | False | sports_car | 0.139952 | False | car_wheel | 0.044173 | False |
2047 | 886736880519319552 | https://pbs.twimg.com/media/DE5Se8FXcAAJFx4.jpg | 1 | kuvasz | 0.309706 | True | Great_Pyrenees | 0.186136 | True | Dandie_Dinmont | 0.086346 | True |
2048 | 886983233522544640 | https://pbs.twimg.com/media/DE8yicJW0AAAvBJ.jpg | 2 | Chihuahua | 0.793469 | True | toy_terrier | 0.143528 | True | can_opener | 0.032253 | False |
2049 | 887101392804085760 | https://pbs.twimg.com/media/DE-eAq6UwAA-jaE.jpg | 1 | Samoyed | 0.733942 | True | Eskimo_dog | 0.035029 | True | Staffordshire_bullterrier | 0.029705 | True |
2050 | 887343217045368832 | https://pbs.twimg.com/ext_tw_video_thumb/88734... | 1 | Mexican_hairless | 0.330741 | True | sea_lion | 0.275645 | False | Weimaraner | 0.134203 | True |
2051 | 887473957103951883 | https://pbs.twimg.com/media/DFDw2tyUQAAAFke.jpg | 2 | Pembroke | 0.809197 | True | Rhodesian_ridgeback | 0.054950 | True | beagle | 0.038915 | True |
2052 | 887517139158093824 | https://pbs.twimg.com/ext_tw_video_thumb/88751... | 1 | limousine | 0.130432 | False | tow_truck | 0.029175 | False | shopping_cart | 0.026321 | False |
2053 | 887705289381826560 | https://pbs.twimg.com/media/DFHDQBbXgAEqY7t.jpg | 1 | basset | 0.821664 | True | redbone | 0.087582 | True | Weimaraner | 0.026236 | True |
2054 | 888078434458587136 | https://pbs.twimg.com/media/DFMWn56WsAAkA7B.jpg | 1 | French_bulldog | 0.995026 | True | pug | 0.000932 | True | bull_mastiff | 0.000903 | True |
2055 | 888202515573088257 | https://pbs.twimg.com/media/DFDw2tyUQAAAFke.jpg | 2 | Pembroke | 0.809197 | True | Rhodesian_ridgeback | 0.054950 | True | beagle | 0.038915 | True |
2056 | 888554962724278272 | https://pbs.twimg.com/media/DFTH_O-UQAACu20.jpg | 3 | Siberian_husky | 0.700377 | True | Eskimo_dog | 0.166511 | True | malamute | 0.111411 | True |
2057 | 888804989199671297 | https://pbs.twimg.com/media/DFWra-3VYAA2piG.jpg | 1 | golden_retriever | 0.469760 | True | Labrador_retriever | 0.184172 | True | English_setter | 0.073482 | True |
2058 | 888917238123831296 | https://pbs.twimg.com/media/DFYRgsOUQAARGhO.jpg | 1 | golden_retriever | 0.714719 | True | Tibetan_mastiff | 0.120184 | True | Labrador_retriever | 0.105506 | True |
2059 | 889278841981685760 | https://pbs.twimg.com/ext_tw_video_thumb/88927... | 1 | whippet | 0.626152 | True | borzoi | 0.194742 | True | Saluki | 0.027351 | True |
2060 | 889531135344209921 | https://pbs.twimg.com/media/DFg_2PVW0AEHN3p.jpg | 1 | golden_retriever | 0.953442 | True | Labrador_retriever | 0.013834 | True | redbone | 0.007958 | True |
2061 | 889638837579907072 | https://pbs.twimg.com/media/DFihzFfXsAYGDPR.jpg | 1 | French_bulldog | 0.991650 | True | boxer | 0.002129 | True | Staffordshire_bullterrier | 0.001498 | True |
2062 | 889665388333682689 | https://pbs.twimg.com/media/DFi579UWsAAatzw.jpg | 1 | Pembroke | 0.966327 | True | Cardigan | 0.027356 | True | basenji | 0.004633 | True |
2063 | 889880896479866881 | https://pbs.twimg.com/media/DFl99B1WsAITKsg.jpg | 1 | French_bulldog | 0.377417 | True | Labrador_retriever | 0.151317 | True | muzzle | 0.082981 | False |
2064 | 890006608113172480 | https://pbs.twimg.com/media/DFnwSY4WAAAMliS.jpg | 1 | Samoyed | 0.957979 | True | Pomeranian | 0.013884 | True | chow | 0.008167 | True |
2065 | 890240255349198849 | https://pbs.twimg.com/media/DFrEyVuW0AAO3t9.jpg | 1 | Pembroke | 0.511319 | True | Cardigan | 0.451038 | True | Chihuahua | 0.029248 | True |
2066 | 890609185150312448 | https://pbs.twimg.com/media/DFwUU__XcAEpyXI.jpg | 1 | Irish_terrier | 0.487574 | True | Irish_setter | 0.193054 | True | Chesapeake_Bay_retriever | 0.118184 | True |
2067 | 890729181411237888 | https://pbs.twimg.com/media/DFyBahAVwAAhUTd.jpg | 2 | Pomeranian | 0.566142 | True | Eskimo_dog | 0.178406 | True | Pembroke | 0.076507 | True |
2068 | 890971913173991426 | https://pbs.twimg.com/media/DF1eOmZXUAALUcq.jpg | 1 | Appenzeller | 0.341703 | True | Border_collie | 0.199287 | True | ice_lolly | 0.193548 | False |
2069 | 891087950875897856 | https://pbs.twimg.com/media/DF3HwyEWsAABqE6.jpg | 1 | Chesapeake_Bay_retriever | 0.425595 | True | Irish_terrier | 0.116317 | True | Indian_elephant | 0.076902 | False |
2070 | 891327558926688256 | https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg | 2 | basset | 0.555712 | True | English_springer | 0.225770 | True | German_short-haired_pointer | 0.175219 | True |
2071 | 891689557279858688 | https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg | 1 | paper_towel | 0.170278 | False | Labrador_retriever | 0.168086 | True | spatula | 0.040836 | False |
2072 | 891815181378084864 | https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg | 1 | Chihuahua | 0.716012 | True | malamute | 0.078253 | True | kelpie | 0.031379 | True |
2073 | 892177421306343426 | https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg | 1 | Chihuahua | 0.323581 | True | Pekinese | 0.090647 | True | papillon | 0.068957 | True |
2074 | 892420643555336193 | https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg | 1 | orange | 0.097049 | False | bagel | 0.085851 | False | banana | 0.076110 | False |
2075 rows × 12 columns
archive.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2356 entries, 0 to 2355 Data columns (total 17 columns): tweet_id 2356 non-null int64 in_reply_to_status_id 78 non-null float64 in_reply_to_user_id 78 non-null float64 timestamp 2356 non-null object source 2356 non-null object text 2356 non-null object retweeted_status_id 181 non-null float64 retweeted_status_user_id 181 non-null float64 retweeted_status_timestamp 181 non-null object expanded_urls 2297 non-null object rating_numerator 2356 non-null int64 rating_denominator 2356 non-null int64 name 2356 non-null object doggo 2356 non-null object floofer 2356 non-null object pupper 2356 non-null object puppo 2356 non-null object dtypes: float64(4), int64(3), object(10) memory usage: 411.3+ KB
sum(archive['tweet_id'].duplicated())
0
# check timestamp column type
type(archive.timestamp[0])
str
Issues
# check the ratings
archive.rating_numerator.value_counts()
12 558 11 464 10 461 13 351 9 158 8 102 7 55 14 54 5 37 6 32 3 19 4 17 1 9 2 9 420 2 0 2 15 2 75 2 80 1 20 1 24 1 26 1 44 1 50 1 60 1 165 1 84 1 88 1 144 1 182 1 143 1 666 1 960 1 1776 1 17 1 27 1 45 1 99 1 121 1 204 1 Name: rating_numerator, dtype: int64
Above shows unique values where 1776 being the highest and weird value like 666, the devil's number.
#inspect highest rating
archive.query("rating_numerator == 1776")
tweet_id | in_reply_to_status_id | in_reply_to_user_id | timestamp | source | text | retweeted_status_id | retweeted_status_user_id | retweeted_status_timestamp | expanded_urls | rating_numerator | rating_denominator | name | doggo | floofer | pupper | puppo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
979 | 749981277374128128 | NaN | NaN | 2016-07-04 15:00:45 +0000 | <a href="https://about.twitter.com/products/tw... | This is Atticus. He's quite simply America af.... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/749981277... | 1776 | 10 | Atticus | None | None | None | None |
# second highest rating
archive.query("rating_numerator == 960")
tweet_id | in_reply_to_status_id | in_reply_to_user_id | timestamp | source | text | retweeted_status_id | retweeted_status_user_id | retweeted_status_timestamp | expanded_urls | rating_numerator | rating_denominator | name | doggo | floofer | pupper | puppo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
313 | 835246439529840640 | 8.352460e+17 | 26259576.0 | 2017-02-24 21:54:03 +0000 | <a href="http://twitter.com/download/iphone" r... | @jonnysun @Lin_Manuel ok jomny I know you're e... | NaN | NaN | NaN | NaN | 960 | 0 | None | None | None | None | None |
Above is retweet, showing that ratings vary greatly and not consistent throughout. Hints that we can use 10 as a common denominator.
# check for tweets that were extracted incorrectly
# check to see if tweets contain decimals, leading to high values or value like "666" vs 6.66 (in text)
archive[archive['text'].str.contains(r'(\d+\.\d+\/\d+)')]
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:3: UserWarning: This pattern has match groups. To actually get the groups, use str.extract. This is separate from the ipykernel package so we can avoid doing imports until
tweet_id | in_reply_to_status_id | in_reply_to_user_id | timestamp | source | text | retweeted_status_id | retweeted_status_user_id | retweeted_status_timestamp | expanded_urls | rating_numerator | rating_denominator | name | doggo | floofer | pupper | puppo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
45 | 883482846933004288 | NaN | NaN | 2017-07-08 00:28:19 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Bella. She hopes her smile made you sm... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/883482846... | 5 | 10 | Bella | None | None | None | None |
340 | 832215909146226688 | NaN | NaN | 2017-02-16 13:11:49 +0000 | <a href="http://twitter.com/download/iphone" r... | RT @dog_rates: This is Logan, the Chow who liv... | 7.867091e+17 | 4.196984e+09 | 2016-10-13 23:23:56 +0000 | https://twitter.com/dog_rates/status/786709082... | 75 | 10 | Logan | None | None | None | None |
695 | 786709082849828864 | NaN | NaN | 2016-10-13 23:23:56 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Logan, the Chow who lived. He solemnly... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/786709082... | 75 | 10 | Logan | None | None | None | None |
763 | 778027034220126208 | NaN | NaN | 2016-09-20 00:24:34 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Sophie. She's a Jubilant Bush Pupper. ... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/778027034... | 27 | 10 | Sophie | None | None | pupper | None |
1689 | 681340665377193984 | 6.813394e+17 | 4.196984e+09 | 2015-12-28 05:07:27 +0000 | <a href="http://twitter.com/download/iphone" r... | I've been told there's a slight possibility he... | NaN | NaN | NaN | NaN | 5 | 10 | None | None | None | None | None |
1712 | 680494726643068929 | NaN | NaN | 2015-12-25 21:06:00 +0000 | <a href="http://twitter.com/download/iphone" r... | Here we have uncovered an entire battalion of ... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/680494726... | 26 | 10 | None | None | None | None | None |
Issues
# check the rating_denominator column
archive.rating_denominator.value_counts()
10 2333 11 3 50 3 80 2 20 2 2 1 16 1 40 1 70 1 15 1 90 1 110 1 120 1 130 1 150 1 170 1 7 1 0 1 Name: rating_denominator, dtype: int64
# query select the tweet where denominator is 0 and view it's tweet ('text')
archive.query("rating_denominator == 0")['text']
313 @jonnysun @Lin_Manuel ok jomny I know you're e... Name: text, dtype: object
archive.query("rating_denominator == 7")['text']
516 Meet Sam. She smiles 24/7 & secretly aspir... Name: text, dtype: object
archive.query("rating_denominator == 170")['text']
1120 Say hello to this unbelievably well behaved sq... Name: text, dtype: object
Issues
# check the name column
archive.name.value_counts()
None 745 a 55 Charlie 12 Lucy 11 Oliver 11 Cooper 11 Tucker 10 Lola 10 Penny 10 Bo 9 Winston 9 the 8 Sadie 8 Daisy 7 Bailey 7 Toby 7 an 7 Buddy 7 Leo 6 Stanley 6 Bella 6 Jack 6 Milo 6 Koda 6 Jax 6 Scout 6 Dave 6 Rusty 6 Oscar 6 Alfie 5 ... Tilly 1 Iggy 1 Maks 1 Vince 1 Monster 1 Damon 1 Chaz 1 Acro 1 Beya 1 Sunshine 1 Erik 1 Marq 1 Aqua 1 BeBe 1 Timmy 1 Teddy 1 Charl 1 Jimbo 1 Schnozz 1 Goliath 1 Carter 1 Claude 1 Mack 1 Andy 1 Grizz 1 Anthony 1 Kuyu 1 Skye 1 Perry 1 Jerome 1 Name: name, Length: 957, dtype: int64
Above shows lots of questionable "names" like "a" and "the" which are non-names, signaling extraction error.
# check names that aren't capitalized b/c names should be capitalized
# use regex where names start with lower-case letters
non_capital_names = archive.name.str.contains(pat='^[a-z]', regex = True)
archive[non_capital_names].name.value_counts()
a 55 the 8 an 7 very 5 just 4 one 4 quite 4 not 2 getting 2 actually 2 mad 2 his 1 unacceptable 1 by 1 space 1 light 1 all 1 my 1 old 1 such 1 life 1 incredibly 1 this 1 officially 1 infuriating 1 Name: name, dtype: int64
Issues
images.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2075 entries, 0 to 2074 Data columns (total 12 columns): tweet_id 2075 non-null int64 jpg_url 2075 non-null object img_num 2075 non-null int64 p1 2075 non-null object p1_conf 2075 non-null float64 p1_dog 2075 non-null bool p2 2075 non-null object p2_conf 2075 non-null float64 p2_dog 2075 non-null bool p3 2075 non-null object p3_conf 2075 non-null float64 p3_dog 2075 non-null bool dtypes: bool(3), float64(3), int64(2), object(4) memory usage: 152.1+ KB
Issues
# evaluate a sample of the image predictions
images.sample(5)
tweet_id | jpg_url | img_num | p1 | p1_conf | p1_dog | p2 | p2_conf | p2_dog | p3 | p3_conf | p3_dog | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1899 | 851224888060895234 | https://pbs.twimg.com/media/C9AohFoWsAUmxDs.jpg | 3 | car_mirror | 0.971512 | False | seat_belt | 0.007063 | False | standard_poodle | 0.005683 | True |
1199 | 740711788199743490 | https://pbs.twimg.com/media/CkeJcNkXEAAcrks.jpg | 1 | toy_poodle | 0.388277 | True | Angora | 0.180264 | False | Persian_cat | 0.049656 | False |
381 | 673317986296586240 | https://pbs.twimg.com/media/CVgbIobUYAEaeI3.jpg | 2 | miniature_pinscher | 0.384099 | True | bloodhound | 0.079923 | True | Rottweiler | 0.068594 | True |
15 | 666099513787052032 | https://pbs.twimg.com/media/CT51-JJUEAA6hV8.jpg | 1 | Lhasa | 0.582330 | True | Shih-Tzu | 0.166192 | True | Dandie_Dinmont | 0.089688 | True |
131 | 668297328638447616 | https://pbs.twimg.com/media/CUZE4IWW4AAZmDf.jpg | 1 | king_penguin | 0.606747 | False | ice_bear | 0.264221 | False | Eskimo_dog | 0.032784 | True |
# check for duplicates of url
images.jpg_url.duplicated().value_counts()
False 2009 True 66 Name: jpg_url, dtype: int64
Issues
# inspect what we generated from Twitter API
api_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2331 entries, 0 to 2330 Data columns (total 7 columns): tweet_id 2331 non-null object favorite_count 2331 non-null int64 retweet_count 2331 non-null int64 followers_count 2331 non-null int64 friends_count 2331 non-null int64 url 2331 non-null object retweeted_status 2331 non-null object dtypes: int64(4), object(3) memory usage: 127.6+ KB
# inspect a sample
api_data.sample(5)
tweet_id | favorite_count | retweet_count | followers_count | friends_count | url | retweeted_status | |
---|---|---|---|---|---|---|---|
2177 | 668643542311546881 | 847 | 508 | 8767948 | 15 | https://t.co/IiSdmhkC5N | original |
738 | 778624900596654080 | 4724 | 1029 | 8767947 | 15 | https://t.co/0rRxyBQt32 | original |
1133 | 724405726123311104 | 5260 | 1641 | 8767945 | 15 | https://t.co/nRUaKWnJfH | original |
1784 | 676864501615042560 | 2079 | 699 | 8767947 | 15 | https://t.co/NJj2uP0atK | original |
1369 | 700029284593901568 | 2078 | 583 | 8767945 | 15 | https://t.co/8Nuny4lLE3 | original |
# count for retweets and original tweets
api_data.retweeted_status.value_counts()
original 2168 retweet 163 Name: retweeted_status, dtype: int64
# copy all original data frames
clean_archive = archive.copy()
clean_images = images.copy()
clean_api = api_data.copy()
The given Twitter Archive df contains 181 retweets. Remove rows that are retweets.
sum(clean_archive.retweeted_status_id.notnull())
181
clean_archive = clean_archive[clean_archive.retweeted_status_id.isna()]
sum(clean_archive.retweeted_status_id.notnull())
0
# we can remove retweet columns from the df
clean_archive = clean_archive.drop(['retweeted_status_id',
'retweeted_status_user_id',
'retweeted_status_timestamp'], axis = 1)
The given Twitter Archive df contains 78 retweets. Remove rows that are replies.
sum(clean_archive.in_reply_to_status_id.notnull())
78
clean_archive = clean_archive[clean_archive.in_reply_to_status_id.isna()]
sum(clean_archive.in_reply_to_status_id.notnull())
0
# we can also remove replies columns from the df
clean_archive = clean_archive.drop(['in_reply_to_status_id',
'in_reply_to_user_id'], axis = 1)
The timestamp column data type is string. Change it to datetime data type.
clean_archive['timestamp'] = pd.to_datetime(clean_archive.timestamp)
clean_archive.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2097 entries, 0 to 2355 Data columns (total 12 columns): tweet_id 2097 non-null int64 timestamp 2097 non-null datetime64[ns] source 2097 non-null object text 2097 non-null object expanded_urls 2094 non-null object rating_numerator 2097 non-null int64 rating_denominator 2097 non-null int64 name 2097 non-null object doggo 2097 non-null object floofer 2097 non-null object pupper 2097 non-null object puppo 2097 non-null object dtypes: datetime64[ns](1), int64(3), object(8) memory usage: 213.0+ KB
The tweet_id column data type is integer. Change it to string data type.
clean_archive['tweet_id'] = clean_archive['tweet_id'].apply(str)
clean_archive.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2097 entries, 0 to 2355 Data columns (total 12 columns): tweet_id 2097 non-null object timestamp 2097 non-null datetime64[ns] source 2097 non-null object text 2097 non-null object expanded_urls 2094 non-null object rating_numerator 2097 non-null int64 rating_denominator 2097 non-null int64 name 2097 non-null object doggo 2097 non-null object floofer 2097 non-null object pupper 2097 non-null object puppo 2097 non-null object dtypes: datetime64[ns](1), int64(2), object(9) memory usage: 213.0+ KB
type(clean_archive.tweet_id[0])
str
Change rating_numerator column data type to float.
clean_archive['rating_numerator'] = clean_archive['rating_numerator'].astype(float)
clean_archive.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2097 entries, 0 to 2355 Data columns (total 12 columns): tweet_id 2097 non-null object timestamp 2097 non-null datetime64[ns] source 2097 non-null object text 2097 non-null object expanded_urls 2094 non-null object rating_numerator 2097 non-null float64 rating_denominator 2097 non-null int64 name 2097 non-null object doggo 2097 non-null object floofer 2097 non-null object pupper 2097 non-null object puppo 2097 non-null object dtypes: datetime64[ns](1), float64(1), int64(1), object(9) memory usage: 293.0+ KB
As the common denominator should be 10, change all denominators that aren't equal to 10.
clean_archive["rating_denominator"] = clean_archive["rating_denominator"].apply(lambda x: 10 if x != 10 else 10)
clean_archive.query("rating_denominator!=10")
tweet_id | timestamp | source | text | expanded_urls | rating_numerator | rating_denominator | name | doggo | floofer | pupper | puppo |
---|
To keep consistecy, change rating_denominator to floats.
clean_archive['rating_denominator'] = clean_archive['rating_denominator'].astype(float)
clean_archive.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2097 entries, 0 to 2355 Data columns (total 12 columns): tweet_id 2097 non-null object timestamp 2097 non-null datetime64[ns] source 2097 non-null object text 2097 non-null object expanded_urls 2094 non-null object rating_numerator 2097 non-null float64 rating_denominator 2097 non-null float64 name 2097 non-null object doggo 2097 non-null object floofer 2097 non-null object pupper 2097 non-null object puppo 2097 non-null object dtypes: datetime64[ns](1), float64(2), object(9) memory usage: 293.0+ KB
Replace all non-names in the name column with string 'None'.
clean_archive.name.value_counts()
None 603 a 55 Lucy 11 Charlie 11 Oliver 10 Cooper 10 Penny 9 Tucker 9 Winston 8 the 8 Lola 8 Sadie 8 Daisy 7 Toby 7 Koda 6 Bella 6 an 6 Jax 6 Bo 6 Bailey 6 Oscar 6 Stanley 6 Chester 5 Bentley 5 Milo 5 Dave 5 Rusty 5 Louis 5 Scout 5 Leo 5 ... Ralpher 1 Cheesy 1 Duchess 1 Brutus 1 Beya 1 Jimbo 1 Ralphé 1 Charl 1 Leonard 1 Obie 1 Jennifur 1 Pinot 1 Chelsea 1 Arlen 1 Brooks 1 Maxwell 1 Diogi 1 Emma 1 Buckley 1 Barclay 1 Jeb 1 Storkson 1 Sunshine 1 Erik 1 Marq 1 Aqua 1 BeBe 1 Robin 1 Teddy 1 Jerome 1 Name: name, Length: 955, dtype: int64
# use regex where names start with lower-case letters
mask = clean_archive.name.str.contains(pat='^[a-z]', regex = True)
clean_archive[mask].name.value_counts().sort_index()
a 55 actually 2 all 1 an 6 by 1 getting 2 his 1 incredibly 1 infuriating 1 just 3 life 1 light 1 mad 1 my 1 not 2 officially 1 old 1 one 4 quite 3 space 1 such 1 the 8 this 1 unacceptable 1 very 4 Name: name, dtype: int64
# replace non-names in the name column with string 'None'
clean_archive.loc[mask, 'name'] = "None"
sum(clean_archive.name.str.contains(pat='^[a-z]', regex = True))
0
Create one column for dog stages and have them (doggo,floofer,pupper,puppo) as values instead.
clean_archive.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2097 entries, 0 to 2355 Data columns (total 12 columns): tweet_id 2097 non-null object timestamp 2097 non-null datetime64[ns] source 2097 non-null object text 2097 non-null object expanded_urls 2094 non-null object rating_numerator 2097 non-null float64 rating_denominator 2097 non-null float64 name 2097 non-null object doggo 2097 non-null object floofer 2097 non-null object pupper 2097 non-null object puppo 2097 non-null object dtypes: datetime64[ns](1), float64(2), object(9) memory usage: 293.0+ KB
clean_archive.doggo.value_counts()
None 2014 doggo 83 Name: doggo, dtype: int64
clean_archive.floofer.value_counts()
None 2087 floofer 10 Name: floofer, dtype: int64
clean_archive.pupper.value_counts()
None 1867 pupper 230 Name: pupper, dtype: int64
clean_archive.puppo.value_counts()
None 2073 puppo 24 Name: puppo, dtype: int64
# convert 'None' to empty strings similarly to what we did before
clean_archive.doggo.replace('None', '', inplace=True)
clean_archive.floofer.replace('None', '', inplace=True)
clean_archive.pupper.replace('None', '', inplace=True)
clean_archive.puppo.replace('None', '', inplace=True)
# combine the 4 columns
clean_archive['stage'] = clean_archive.text.str.extract('(doggo|floofer|pupper|puppo)', expand = True)
# check
clean_archive.stage.value_counts()
pupper 240 doggo 80 puppo 29 floofer 4 Name: stage, dtype: int64
#double check
clean_archive.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2097 entries, 0 to 2355 Data columns (total 13 columns): tweet_id 2097 non-null object timestamp 2097 non-null datetime64[ns] source 2097 non-null object text 2097 non-null object expanded_urls 2094 non-null object rating_numerator 2097 non-null float64 rating_denominator 2097 non-null float64 name 2097 non-null object doggo 2097 non-null object floofer 2097 non-null object pupper 2097 non-null object puppo 2097 non-null object stage 353 non-null object dtypes: datetime64[ns](1), float64(2), object(10) memory usage: 309.4+ KB
# now drop the 4 columns as they are no longer needed
clean_archive.drop(['doggo','floofer','pupper','puppo'], axis=1, inplace = True)
# change datatype from object to category
clean_archive['stage'] = clean_archive['stage'].astype('category')
clean_archive.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2097 entries, 0 to 2355 Data columns (total 9 columns): tweet_id 2097 non-null object timestamp 2097 non-null datetime64[ns] source 2097 non-null object text 2097 non-null object expanded_urls 2094 non-null object rating_numerator 2097 non-null float64 rating_denominator 2097 non-null float64 name 2097 non-null object stage 353 non-null category dtypes: category(1), datetime64[ns](1), float64(2), object(5) memory usage: 229.7+ KB
Dataframe contains 66 duplicated jpg_urls which need to be removed.
clean_images = clean_images.drop_duplicates(subset=['jpg_url'], keep='last')
sum(clean_images.jpg_url.duplicated())
0
The tweet_id column data type is integer. Change it to string data type to keep it consistent with other data frames.
clean_images['tweet_id'] = clean_images['tweet_id'].apply(str)
clean_images.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2009 entries, 0 to 2074 Data columns (total 12 columns): tweet_id 2009 non-null object jpg_url 2009 non-null object img_num 2009 non-null int64 p1 2009 non-null object p1_conf 2009 non-null float64 p1_dog 2009 non-null bool p2 2009 non-null object p2_conf 2009 non-null float64 p2_dog 2009 non-null bool p3 2009 non-null object p3_conf 2009 non-null float64 p3_dog 2009 non-null bool dtypes: bool(3), float64(3), int64(1), object(5) memory usage: 162.8+ KB
type(clean_images.tweet_id[0])
str
Clean up by having 1 column each for image prediction and confidence level, rather than having them in 3 different columns.
clean_images
tweet_id | jpg_url | img_num | p1 | p1_conf | p1_dog | p2 | p2_conf | p2_dog | p3 | p3_conf | p3_dog | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 666020888022790149 | https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg | 1 | Welsh_springer_spaniel | 0.465074 | True | collie | 0.156665 | True | Shetland_sheepdog | 0.061428 | True |
1 | 666029285002620928 | https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg | 1 | redbone | 0.506826 | True | miniature_pinscher | 0.074192 | True | Rhodesian_ridgeback | 0.072010 | True |
2 | 666033412701032449 | https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg | 1 | German_shepherd | 0.596461 | True | malinois | 0.138584 | True | bloodhound | 0.116197 | True |
3 | 666044226329800704 | https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg | 1 | Rhodesian_ridgeback | 0.408143 | True | redbone | 0.360687 | True | miniature_pinscher | 0.222752 | True |
4 | 666049248165822465 | https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg | 1 | miniature_pinscher | 0.560311 | True | Rottweiler | 0.243682 | True | Doberman | 0.154629 | True |
5 | 666050758794694657 | https://pbs.twimg.com/media/CT5Jof1WUAEuVxN.jpg | 1 | Bernese_mountain_dog | 0.651137 | True | English_springer | 0.263788 | True | Greater_Swiss_Mountain_dog | 0.016199 | True |
6 | 666051853826850816 | https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg | 1 | box_turtle | 0.933012 | False | mud_turtle | 0.045885 | False | terrapin | 0.017885 | False |
7 | 666055525042405380 | https://pbs.twimg.com/media/CT5N9tpXIAAifs1.jpg | 1 | chow | 0.692517 | True | Tibetan_mastiff | 0.058279 | True | fur_coat | 0.054449 | False |
8 | 666057090499244032 | https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg | 1 | shopping_cart | 0.962465 | False | shopping_basket | 0.014594 | False | golden_retriever | 0.007959 | True |
9 | 666058600524156928 | https://pbs.twimg.com/media/CT5Qw94XAAA_2dP.jpg | 1 | miniature_poodle | 0.201493 | True | komondor | 0.192305 | True | soft-coated_wheaten_terrier | 0.082086 | True |
10 | 666063827256086533 | https://pbs.twimg.com/media/CT5Vg_wXIAAXfnj.jpg | 1 | golden_retriever | 0.775930 | True | Tibetan_mastiff | 0.093718 | True | Labrador_retriever | 0.072427 | True |
11 | 666071193221509120 | https://pbs.twimg.com/media/CT5cN_3WEAAlOoZ.jpg | 1 | Gordon_setter | 0.503672 | True | Yorkshire_terrier | 0.174201 | True | Pekinese | 0.109454 | True |
12 | 666073100786774016 | https://pbs.twimg.com/media/CT5d9DZXAAALcwe.jpg | 1 | Walker_hound | 0.260857 | True | English_foxhound | 0.175382 | True | Ibizan_hound | 0.097471 | True |
13 | 666082916733198337 | https://pbs.twimg.com/media/CT5m4VGWEAAtKc8.jpg | 1 | pug | 0.489814 | True | bull_mastiff | 0.404722 | True | French_bulldog | 0.048960 | True |
14 | 666094000022159362 | https://pbs.twimg.com/media/CT5w9gUW4AAsBNN.jpg | 1 | bloodhound | 0.195217 | True | German_shepherd | 0.078260 | True | malinois | 0.075628 | True |
15 | 666099513787052032 | https://pbs.twimg.com/media/CT51-JJUEAA6hV8.jpg | 1 | Lhasa | 0.582330 | True | Shih-Tzu | 0.166192 | True | Dandie_Dinmont | 0.089688 | True |
16 | 666102155909144576 | https://pbs.twimg.com/media/CT54YGiWUAEZnoK.jpg | 1 | English_setter | 0.298617 | True | Newfoundland | 0.149842 | True | borzoi | 0.133649 | True |
17 | 666104133288665088 | https://pbs.twimg.com/media/CT56LSZWoAAlJj2.jpg | 1 | hen | 0.965932 | False | cock | 0.033919 | False | partridge | 0.000052 | False |
18 | 666268910803644416 | https://pbs.twimg.com/media/CT8QCd1WEAADXws.jpg | 1 | desktop_computer | 0.086502 | False | desk | 0.085547 | False | bookcase | 0.079480 | False |
19 | 666273097616637952 | https://pbs.twimg.com/media/CT8T1mtUwAA3aqm.jpg | 1 | Italian_greyhound | 0.176053 | True | toy_terrier | 0.111884 | True | basenji | 0.111152 | True |
20 | 666287406224695296 | https://pbs.twimg.com/media/CT8g3BpUEAAuFjg.jpg | 1 | Maltese_dog | 0.857531 | True | toy_poodle | 0.063064 | True | miniature_poodle | 0.025581 | True |
21 | 666293911632134144 | https://pbs.twimg.com/media/CT8mx7KW4AEQu8N.jpg | 1 | three-toed_sloth | 0.914671 | False | otter | 0.015250 | False | great_grey_owl | 0.013207 | False |
22 | 666337882303524864 | https://pbs.twimg.com/media/CT9OwFIWEAMuRje.jpg | 1 | ox | 0.416669 | False | Newfoundland | 0.278407 | True | groenendael | 0.102643 | True |
23 | 666345417576210432 | https://pbs.twimg.com/media/CT9Vn7PWoAA_ZCM.jpg | 1 | golden_retriever | 0.858744 | True | Chesapeake_Bay_retriever | 0.054787 | True | Labrador_retriever | 0.014241 | True |
24 | 666353288456101888 | https://pbs.twimg.com/media/CT9cx0tUEAAhNN_.jpg | 1 | malamute | 0.336874 | True | Siberian_husky | 0.147655 | True | Eskimo_dog | 0.093412 | True |
25 | 666362758909284353 | https://pbs.twimg.com/media/CT9lXGsUcAAyUFt.jpg | 1 | guinea_pig | 0.996496 | False | skunk | 0.002402 | False | hamster | 0.000461 | False |
26 | 666373753744588802 | https://pbs.twimg.com/media/CT9vZEYWUAAlZ05.jpg | 1 | soft-coated_wheaten_terrier | 0.326467 | True | Afghan_hound | 0.259551 | True | briard | 0.206803 | True |
27 | 666396247373291520 | https://pbs.twimg.com/media/CT-D2ZHWIAA3gK1.jpg | 1 | Chihuahua | 0.978108 | True | toy_terrier | 0.009397 | True | papillon | 0.004577 | True |
28 | 666407126856765440 | https://pbs.twimg.com/media/CT-NvwmW4AAugGZ.jpg | 1 | black-and-tan_coonhound | 0.529139 | True | bloodhound | 0.244220 | True | flat-coated_retriever | 0.173810 | True |
29 | 666411507551481857 | https://pbs.twimg.com/media/CT-RugiWIAELEaq.jpg | 1 | coho | 0.404640 | False | barracouta | 0.271485 | False | gar | 0.189945 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2044 | 886258384151887873 | https://pbs.twimg.com/media/DEyfTG4UMAE4aE9.jpg | 1 | pug | 0.943575 | True | shower_cap | 0.025286 | False | Siamese_cat | 0.002849 | False |
2045 | 886366144734445568 | https://pbs.twimg.com/media/DE0BTnQUwAApKEH.jpg | 1 | French_bulldog | 0.999201 | True | Chihuahua | 0.000361 | True | Boston_bull | 0.000076 | True |
2046 | 886680336477933568 | https://pbs.twimg.com/media/DE4fEDzWAAAyHMM.jpg | 1 | convertible | 0.738995 | False | sports_car | 0.139952 | False | car_wheel | 0.044173 | False |
2047 | 886736880519319552 | https://pbs.twimg.com/media/DE5Se8FXcAAJFx4.jpg | 1 | kuvasz | 0.309706 | True | Great_Pyrenees | 0.186136 | True | Dandie_Dinmont | 0.086346 | True |
2048 | 886983233522544640 | https://pbs.twimg.com/media/DE8yicJW0AAAvBJ.jpg | 2 | Chihuahua | 0.793469 | True | toy_terrier | 0.143528 | True | can_opener | 0.032253 | False |
2049 | 887101392804085760 | https://pbs.twimg.com/media/DE-eAq6UwAA-jaE.jpg | 1 | Samoyed | 0.733942 | True | Eskimo_dog | 0.035029 | True | Staffordshire_bullterrier | 0.029705 | True |
2050 | 887343217045368832 | https://pbs.twimg.com/ext_tw_video_thumb/88734... | 1 | Mexican_hairless | 0.330741 | True | sea_lion | 0.275645 | False | Weimaraner | 0.134203 | True |
2052 | 887517139158093824 | https://pbs.twimg.com/ext_tw_video_thumb/88751... | 1 | limousine | 0.130432 | False | tow_truck | 0.029175 | False | shopping_cart | 0.026321 | False |
2053 | 887705289381826560 | https://pbs.twimg.com/media/DFHDQBbXgAEqY7t.jpg | 1 | basset | 0.821664 | True | redbone | 0.087582 | True | Weimaraner | 0.026236 | True |
2054 | 888078434458587136 | https://pbs.twimg.com/media/DFMWn56WsAAkA7B.jpg | 1 | French_bulldog | 0.995026 | True | pug | 0.000932 | True | bull_mastiff | 0.000903 | True |
2055 | 888202515573088257 | https://pbs.twimg.com/media/DFDw2tyUQAAAFke.jpg | 2 | Pembroke | 0.809197 | True | Rhodesian_ridgeback | 0.054950 | True | beagle | 0.038915 | True |
2056 | 888554962724278272 | https://pbs.twimg.com/media/DFTH_O-UQAACu20.jpg | 3 | Siberian_husky | 0.700377 | True | Eskimo_dog | 0.166511 | True | malamute | 0.111411 | True |
2057 | 888804989199671297 | https://pbs.twimg.com/media/DFWra-3VYAA2piG.jpg | 1 | golden_retriever | 0.469760 | True | Labrador_retriever | 0.184172 | True | English_setter | 0.073482 | True |
2058 | 888917238123831296 | https://pbs.twimg.com/media/DFYRgsOUQAARGhO.jpg | 1 | golden_retriever | 0.714719 | True | Tibetan_mastiff | 0.120184 | True | Labrador_retriever | 0.105506 | True |
2059 | 889278841981685760 | https://pbs.twimg.com/ext_tw_video_thumb/88927... | 1 | whippet | 0.626152 | True | borzoi | 0.194742 | True | Saluki | 0.027351 | True |
2060 | 889531135344209921 | https://pbs.twimg.com/media/DFg_2PVW0AEHN3p.jpg | 1 | golden_retriever | 0.953442 | True | Labrador_retriever | 0.013834 | True | redbone | 0.007958 | True |
2061 | 889638837579907072 | https://pbs.twimg.com/media/DFihzFfXsAYGDPR.jpg | 1 | French_bulldog | 0.991650 | True | boxer | 0.002129 | True | Staffordshire_bullterrier | 0.001498 | True |
2062 | 889665388333682689 | https://pbs.twimg.com/media/DFi579UWsAAatzw.jpg | 1 | Pembroke | 0.966327 | True | Cardigan | 0.027356 | True | basenji | 0.004633 | True |
2063 | 889880896479866881 | https://pbs.twimg.com/media/DFl99B1WsAITKsg.jpg | 1 | French_bulldog | 0.377417 | True | Labrador_retriever | 0.151317 | True | muzzle | 0.082981 | False |
2064 | 890006608113172480 | https://pbs.twimg.com/media/DFnwSY4WAAAMliS.jpg | 1 | Samoyed | 0.957979 | True | Pomeranian | 0.013884 | True | chow | 0.008167 | True |
2065 | 890240255349198849 | https://pbs.twimg.com/media/DFrEyVuW0AAO3t9.jpg | 1 | Pembroke | 0.511319 | True | Cardigan | 0.451038 | True | Chihuahua | 0.029248 | True |
2066 | 890609185150312448 | https://pbs.twimg.com/media/DFwUU__XcAEpyXI.jpg | 1 | Irish_terrier | 0.487574 | True | Irish_setter | 0.193054 | True | Chesapeake_Bay_retriever | 0.118184 | True |
2067 | 890729181411237888 | https://pbs.twimg.com/media/DFyBahAVwAAhUTd.jpg | 2 | Pomeranian | 0.566142 | True | Eskimo_dog | 0.178406 | True | Pembroke | 0.076507 | True |
2068 | 890971913173991426 | https://pbs.twimg.com/media/DF1eOmZXUAALUcq.jpg | 1 | Appenzeller | 0.341703 | True | Border_collie | 0.199287 | True | ice_lolly | 0.193548 | False |
2069 | 891087950875897856 | https://pbs.twimg.com/media/DF3HwyEWsAABqE6.jpg | 1 | Chesapeake_Bay_retriever | 0.425595 | True | Irish_terrier | 0.116317 | True | Indian_elephant | 0.076902 | False |
2070 | 891327558926688256 | https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg | 2 | basset | 0.555712 | True | English_springer | 0.225770 | True | German_short-haired_pointer | 0.175219 | True |
2071 | 891689557279858688 | https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg | 1 | paper_towel | 0.170278 | False | Labrador_retriever | 0.168086 | True | spatula | 0.040836 | False |
2072 | 891815181378084864 | https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg | 1 | Chihuahua | 0.716012 | True | malamute | 0.078253 | True | kelpie | 0.031379 | True |
2073 | 892177421306343426 | https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg | 1 | Chihuahua | 0.323581 | True | Pekinese | 0.090647 | True | papillon | 0.068957 | True |
2074 | 892420643555336193 | https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg | 1 | orange | 0.097049 | False | bagel | 0.085851 | False | banana | 0.076110 | False |
2009 rows × 12 columns
clean_images.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2009 entries, 0 to 2074 Data columns (total 12 columns): tweet_id 2009 non-null object jpg_url 2009 non-null object img_num 2009 non-null int64 p1 2009 non-null object p1_conf 2009 non-null float64 p1_dog 2009 non-null bool p2 2009 non-null object p2_conf 2009 non-null float64 p2_dog 2009 non-null bool p3 2009 non-null object p3_conf 2009 non-null float64 p3_dog 2009 non-null bool dtypes: bool(3), float64(3), int64(1), object(5) memory usage: 242.8+ KB
clean_images.p1.value_counts()
golden_retriever 139 Labrador_retriever 95 Pembroke 88 Chihuahua 79 pug 55 chow 41 Samoyed 40 Pomeranian 38 toy_poodle 38 malamute 29 cocker_spaniel 27 French_bulldog 26 Chesapeake_Bay_retriever 23 miniature_pinscher 22 seat_belt 21 Siberian_husky 20 German_shepherd 20 web_site 19 Staffordshire_bullterrier 19 Shetland_sheepdog 18 Maltese_dog 18 teddy 18 Eskimo_dog 18 Rottweiler 17 beagle 17 Shih-Tzu 17 Cardigan 17 kuvasz 16 Lakeland_terrier 16 Italian_greyhound 16 ... long-horned_beetle 1 bighorn 1 limousine 1 park_bench 1 sunglasses 1 robin 1 lynx 1 soccer_ball 1 water_bottle 1 grey_fox 1 carousel 1 flamingo 1 swab 1 cup 1 cheeseburger 1 snowmobile 1 alp 1 remote_control 1 lacewing 1 hummingbird 1 bee_eater 1 loupe 1 coral_reef 1 tricycle 1 barbell 1 timber_wolf 1 sundial 1 handkerchief 1 hotdog 1 Madagascar_cat 1 Name: p1, Length: 378, dtype: int64
clean_images.p2.value_counts()
Labrador_retriever 96 golden_retriever 83 Cardigan 72 Chihuahua 43 Chesapeake_Bay_retriever 40 French_bulldog 40 Pomeranian 39 toy_poodle 36 Siberian_husky 33 cocker_spaniel 32 miniature_poodle 32 beagle 28 Eskimo_dog 27 collie 26 kuvasz 25 Pembroke 25 Italian_greyhound 22 Pekinese 21 miniature_pinscher 20 American_Staffordshire_terrier 20 chow 20 malinois 20 Boston_bull 19 toy_terrier 19 Samoyed 19 Norwegian_elkhound 19 Staffordshire_bullterrier 18 pug 17 Shih-Tzu 16 Irish_terrier 16 .. shower_curtain 1 dock 1 desk 1 cradle 1 hamper 1 snowmobile 1 house_finch 1 accordion 1 birdhouse 1 chimpanzee 1 snail 1 lighter 1 bib 1 dumbbell 1 wallaby 1 can_opener 1 porcupine 1 triceratops 1 promontory 1 crate 1 spotted_salamander 1 tree_frog 1 breakwater 1 common_newt 1 iPod 1 tarantula 1 grey_whale 1 mosquito_net 1 menu 1 pelican 1 Name: p2, Length: 405, dtype: int64
clean_images.p3.value_counts()
Labrador_retriever 76 Chihuahua 57 golden_retriever 46 Eskimo_dog 36 kelpie 33 chow 31 kuvasz 31 Staffordshire_bullterrier 30 beagle 30 Pekinese 29 toy_poodle 29 cocker_spaniel 29 Pomeranian 27 Pembroke 26 Great_Pyrenees 26 malamute 25 French_bulldog 25 Chesapeake_Bay_retriever 24 Cardigan 23 American_Staffordshire_terrier 23 pug 21 bull_mastiff 20 basenji 20 Siberian_husky 19 toy_terrier 18 Boston_bull 17 Shetland_sheepdog 17 boxer 16 Lakeland_terrier 16 dingo 15 .. mushroom 1 swimming_trunks 1 pop_bottle 1 meerkat 1 shovel 1 golfcart 1 barbell 1 chimpanzee 1 snail 1 notebook 1 ox 1 bib 1 vacuum 1 Indian_elephant 1 lion 1 acorn 1 chain 1 wallet 1 brown_bear 1 crayfish 1 drumstick 1 buckeye 1 traffic_light 1 desktop_computer 1 greenhouse 1 bow_tie 1 wok 1 plunger 1 common_newt 1 red_wolf 1 Name: p3, Length: 408, dtype: int64
clean_images.query("p1_conf < p2_conf or p2_conf < p3_conf or p1_conf < p3_conf")
tweet_id | jpg_url | img_num | p1 | p1_conf | p1_dog | p2 | p2_conf | p2_dog | p3 | p3_conf | p3_dog |
---|
# get dog breed and its respective confidence
breed = []
confidence = []
def image(clean_images):
if clean_images['p1_dog'] == True:
breed.append(clean_images['p1'])
confidence.append(clean_images['p1_conf'])
elif clean_images['p2_dog'] == True:
breed.append(clean_images['p2'])
confidence.append(clean_images['p2_conf'])
elif clean_images['p3_dog'] == True:
breed.append(clean_images['p3'])
confidence.append(clean_images['p3_conf'])
else: # if they don't match, write "None" b/c tweets beyond August 1st, 2017 don't have image predictions
breed.append('None')
confidence.append('None')
clean_images.apply(image, axis=1)
#create new columns
clean_images['breed'] = breed
clean_images['confidence'] = confidence
clean_images.breed.value_counts()
None 318 golden_retriever 158 Labrador_retriever 108 Pembroke 95 Chihuahua 91 pug 63 toy_poodle 51 chow 48 Samoyed 42 Pomeranian 42 malamute 33 Chesapeake_Bay_retriever 31 French_bulldog 31 cocker_spaniel 30 miniature_pinscher 25 Eskimo_dog 22 German_shepherd 21 Staffordshire_bullterrier 21 Cardigan 21 Shih-Tzu 20 Siberian_husky 20 beagle 20 Shetland_sheepdog 19 Maltese_dog 19 Rottweiler 19 kuvasz 19 Lakeland_terrier 18 basset 17 Italian_greyhound 17 American_Staffordshire_terrier 16 ... Welsh_springer_spaniel 4 Scottish_deerhound 4 Tibetan_terrier 4 keeshond 4 Weimaraner 4 Gordon_setter 4 Greater_Swiss_Mountain_dog 3 toy_terrier 3 Leonberg 3 briard 3 Irish_water_spaniel 3 curly-coated_retriever 3 komondor 3 cairn 3 Brabancon_griffon 3 Afghan_hound 3 Appenzeller 2 Australian_terrier 2 Sussex_spaniel 2 groenendael 2 wire-haired_fox_terrier 2 black-and-tan_coonhound 2 clumber 1 Japanese_spaniel 1 Irish_wolfhound 1 Scotch_terrier 1 silky_terrier 1 standard_schnauzer 1 Bouvier_des_Flandres 1 EntleBucher 1 Name: breed, Length: 114, dtype: int64
# Remove rows where breed is "None" (doens't align to our Project Motivation)
clean_images = clean_images[clean_images['breed'] != 'None']
# Now, drop these columns as they're no longer needed
clean_images = clean_images.drop(['img_num','p1','p1_conf','p1_dog','p2','p2_conf','p2_dog','p3','p3_conf',
'p3_dog'], 1)
clean_images
tweet_id | jpg_url | breed | confidence | |
---|---|---|---|---|
0 | 666020888022790149 | https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg | Welsh_springer_spaniel | 0.465074 |
1 | 666029285002620928 | https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg | redbone | 0.506826 |
2 | 666033412701032449 | https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg | German_shepherd | 0.596461 |
3 | 666044226329800704 | https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg | Rhodesian_ridgeback | 0.408143 |
4 | 666049248165822465 | https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg | miniature_pinscher | 0.560311 |
5 | 666050758794694657 | https://pbs.twimg.com/media/CT5Jof1WUAEuVxN.jpg | Bernese_mountain_dog | 0.651137 |
7 | 666055525042405380 | https://pbs.twimg.com/media/CT5N9tpXIAAifs1.jpg | chow | 0.692517 |
8 | 666057090499244032 | https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg | golden_retriever | 0.00795896 |
9 | 666058600524156928 | https://pbs.twimg.com/media/CT5Qw94XAAA_2dP.jpg | miniature_poodle | 0.201493 |
10 | 666063827256086533 | https://pbs.twimg.com/media/CT5Vg_wXIAAXfnj.jpg | golden_retriever | 0.77593 |
11 | 666071193221509120 | https://pbs.twimg.com/media/CT5cN_3WEAAlOoZ.jpg | Gordon_setter | 0.503672 |
12 | 666073100786774016 | https://pbs.twimg.com/media/CT5d9DZXAAALcwe.jpg | Walker_hound | 0.260857 |
13 | 666082916733198337 | https://pbs.twimg.com/media/CT5m4VGWEAAtKc8.jpg | pug | 0.489814 |
14 | 666094000022159362 | https://pbs.twimg.com/media/CT5w9gUW4AAsBNN.jpg | bloodhound | 0.195217 |
15 | 666099513787052032 | https://pbs.twimg.com/media/CT51-JJUEAA6hV8.jpg | Lhasa | 0.58233 |
16 | 666102155909144576 | https://pbs.twimg.com/media/CT54YGiWUAEZnoK.jpg | English_setter | 0.298617 |
19 | 666273097616637952 | https://pbs.twimg.com/media/CT8T1mtUwAA3aqm.jpg | Italian_greyhound | 0.176053 |
20 | 666287406224695296 | https://pbs.twimg.com/media/CT8g3BpUEAAuFjg.jpg | Maltese_dog | 0.857531 |
22 | 666337882303524864 | https://pbs.twimg.com/media/CT9OwFIWEAMuRje.jpg | Newfoundland | 0.278407 |
23 | 666345417576210432 | https://pbs.twimg.com/media/CT9Vn7PWoAA_ZCM.jpg | golden_retriever | 0.858744 |
24 | 666353288456101888 | https://pbs.twimg.com/media/CT9cx0tUEAAhNN_.jpg | malamute | 0.336874 |
26 | 666373753744588802 | https://pbs.twimg.com/media/CT9vZEYWUAAlZ05.jpg | soft-coated_wheaten_terrier | 0.326467 |
27 | 666396247373291520 | https://pbs.twimg.com/media/CT-D2ZHWIAA3gK1.jpg | Chihuahua | 0.978108 |
28 | 666407126856765440 | https://pbs.twimg.com/media/CT-NvwmW4AAugGZ.jpg | black-and-tan_coonhound | 0.529139 |
30 | 666418789513326592 | https://pbs.twimg.com/media/CT-YWb7U8AA7QnN.jpg | toy_terrier | 0.14968 |
31 | 666421158376562688 | https://pbs.twimg.com/media/CT-aggCXAAIMfT3.jpg | Blenheim_spaniel | 0.906777 |
32 | 666428276349472768 | https://pbs.twimg.com/media/CT-g-0DUwAEQdSn.jpg | Pembroke | 0.371361 |
33 | 666430724426358785 | https://pbs.twimg.com/media/CT-jNYqW4AAPi2M.jpg | Irish_terrier | 0.104109 |
34 | 666435652385423360 | https://pbs.twimg.com/media/CT-nsTQWEAEkyDn.jpg | Chesapeake_Bay_retriever | 0.18413 |
35 | 666437273139982337 | https://pbs.twimg.com/media/CT-pKmRWIAAxUWj.jpg | Chihuahua | 0.671853 |
... | ... | ... | ... | ... |
2041 | 885311592912609280 | https://pbs.twimg.com/media/C4bTH6nWMAAX_bJ.jpg | Labrador_retriever | 0.908703 |
2042 | 885528943205470208 | https://pbs.twimg.com/media/DEoH3yvXgAAzQtS.jpg | pug | 0.369275 |
2043 | 885984800019947520 | https://pbs.twimg.com/media/DEumeWWV0AA-Z61.jpg | Blenheim_spaniel | 0.972494 |
2044 | 886258384151887873 | https://pbs.twimg.com/media/DEyfTG4UMAE4aE9.jpg | pug | 0.943575 |
2045 | 886366144734445568 | https://pbs.twimg.com/media/DE0BTnQUwAApKEH.jpg | French_bulldog | 0.999201 |
2047 | 886736880519319552 | https://pbs.twimg.com/media/DE5Se8FXcAAJFx4.jpg | kuvasz | 0.309706 |
2048 | 886983233522544640 | https://pbs.twimg.com/media/DE8yicJW0AAAvBJ.jpg | Chihuahua | 0.793469 |
2049 | 887101392804085760 | https://pbs.twimg.com/media/DE-eAq6UwAA-jaE.jpg | Samoyed | 0.733942 |
2050 | 887343217045368832 | https://pbs.twimg.com/ext_tw_video_thumb/88734... | Mexican_hairless | 0.330741 |
2053 | 887705289381826560 | https://pbs.twimg.com/media/DFHDQBbXgAEqY7t.jpg | basset | 0.821664 |
2054 | 888078434458587136 | https://pbs.twimg.com/media/DFMWn56WsAAkA7B.jpg | French_bulldog | 0.995026 |
2055 | 888202515573088257 | https://pbs.twimg.com/media/DFDw2tyUQAAAFke.jpg | Pembroke | 0.809197 |
2056 | 888554962724278272 | https://pbs.twimg.com/media/DFTH_O-UQAACu20.jpg | Siberian_husky | 0.700377 |
2057 | 888804989199671297 | https://pbs.twimg.com/media/DFWra-3VYAA2piG.jpg | golden_retriever | 0.46976 |
2058 | 888917238123831296 | https://pbs.twimg.com/media/DFYRgsOUQAARGhO.jpg | golden_retriever | 0.714719 |
2059 | 889278841981685760 | https://pbs.twimg.com/ext_tw_video_thumb/88927... | whippet | 0.626152 |
2060 | 889531135344209921 | https://pbs.twimg.com/media/DFg_2PVW0AEHN3p.jpg | golden_retriever | 0.953442 |
2061 | 889638837579907072 | https://pbs.twimg.com/media/DFihzFfXsAYGDPR.jpg | French_bulldog | 0.99165 |
2062 | 889665388333682689 | https://pbs.twimg.com/media/DFi579UWsAAatzw.jpg | Pembroke | 0.966327 |
2063 | 889880896479866881 | https://pbs.twimg.com/media/DFl99B1WsAITKsg.jpg | French_bulldog | 0.377417 |
2064 | 890006608113172480 | https://pbs.twimg.com/media/DFnwSY4WAAAMliS.jpg | Samoyed | 0.957979 |
2065 | 890240255349198849 | https://pbs.twimg.com/media/DFrEyVuW0AAO3t9.jpg | Pembroke | 0.511319 |
2066 | 890609185150312448 | https://pbs.twimg.com/media/DFwUU__XcAEpyXI.jpg | Irish_terrier | 0.487574 |
2067 | 890729181411237888 | https://pbs.twimg.com/media/DFyBahAVwAAhUTd.jpg | Pomeranian | 0.566142 |
2068 | 890971913173991426 | https://pbs.twimg.com/media/DF1eOmZXUAALUcq.jpg | Appenzeller | 0.341703 |
2069 | 891087950875897856 | https://pbs.twimg.com/media/DF3HwyEWsAABqE6.jpg | Chesapeake_Bay_retriever | 0.425595 |
2070 | 891327558926688256 | https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg | basset | 0.555712 |
2071 | 891689557279858688 | https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg | Labrador_retriever | 0.168086 |
2072 | 891815181378084864 | https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg | Chihuahua | 0.716012 |
2073 | 892177421306343426 | https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg | Chihuahua | 0.323581 |
1691 rows × 4 columns
Remove retweets to keep only the originals and to keep it consistent with the other dataframes.
clean_api.retweeted_status.value_counts()
original 2168 retweet 163 Name: retweeted_status, dtype: int64
clean_api = clean_api.query("retweeted_status=='original'")
clean_api.retweeted_status.value_counts()
original 2168 Name: retweeted_status, dtype: int64
Twitter API dataframe should be merged with the other data frames because they share same entities. Merge the dataframes into one finalized dataframed called twitter_archive_master.
#merge the first two dataframes we explored onto tweet_id column
archive_and_images = pd.merge(clean_archive,
clean_images,
how = 'left', on = ['tweet_id'])
archive_and_images = archive_and_images[archive_and_images['jpg_url'].notnull()]
archive_and_images.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1606 entries, 1 to 2096 Data columns (total 12 columns): tweet_id 1606 non-null object timestamp 1606 non-null datetime64[ns] source 1606 non-null object text 1606 non-null object expanded_urls 1606 non-null object rating_numerator 1606 non-null float64 rating_denominator 1606 non-null float64 name 1606 non-null object stage 262 non-null category jpg_url 1606 non-null object breed 1606 non-null object confidence 1606 non-null object dtypes: category(1), datetime64[ns](1), float64(2), object(8) memory usage: 152.3+ KB
twitter_archive_master = pd.merge(archive_and_images,
clean_api,
how = 'left', on = ['tweet_id'])
twitter_archive_master.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1606 entries, 0 to 1605 Data columns (total 18 columns): tweet_id 1606 non-null object timestamp 1606 non-null datetime64[ns] source 1606 non-null object text 1606 non-null object expanded_urls 1606 non-null object rating_numerator 1606 non-null float64 rating_denominator 1606 non-null float64 name 1606 non-null object stage 262 non-null category jpg_url 1606 non-null object breed 1606 non-null object confidence 1606 non-null object favorite_count 1600 non-null float64 retweet_count 1600 non-null float64 followers_count 1600 non-null float64 friends_count 1600 non-null float64 url 1600 non-null object retweeted_status 1600 non-null object dtypes: category(1), datetime64[ns](1), float64(6), object(10) memory usage: 227.6+ KB
As the three dataframes are now merged into one, we want to set it as a file perhaps for later use.
# store the frame as a file
twitter_archive_master.to_csv('twitter_archive_master.csv',
index=False, encoding = 'utf-8')
In order to see which breeds have the most ratings, I will make a new dataframe that contains the breeds with the highest number of tweets assigned.
# select breeds with 50 or more tweets
most_tweeted = twitter_archive_master.groupby('breed').filter(lambda x: len(x) >= 50)
# bar chart
most_tweeted['breed'].value_counts().plot(kind = 'bar')
plt.title("We Rate Dogs's Most Rated Breeds")
plt.xlabel('Breed')
plt.ylabel('Tweets Count')
fig = plt.gcf()
fig.set_size_inches(15,7)
/opt/conda/lib/python3.6/site-packages/matplotlib/font_manager.py:1316: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans (prop.get_family(), self.defaultFamily[fontext]))
Looking at their tweets count, the most rated breeds on "We Love Dogs" Twitter account are Golden Retriver, Labrador Retriever, Pembroke, Chihuahua, and Pug.
# save the figure as file
fig.savefig('most_rated_breeds.png',bbox_inches='tight');
/opt/conda/lib/python3.6/site-packages/matplotlib/font_manager.py:1316: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans (prop.get_family(), self.defaultFamily[fontext]))
In order to show which dog stage is the most represented, I will display the frequencies of the stages via pie chart.
twitter_archive_master.stage.value_counts()
pupper 176 doggo 57 puppo 26 floofer 3 Name: stage, dtype: int64
# create pie chart with explode property
# https://medium.com/@kvnamipara/a-better-visualisation-of-pie-charts-by-matplotlib-935b7667d77f
labels = ['pupper', 'doggo', 'puppo', 'floofer']
sizes = [176, 57, 26, 3]
# colors
colors = ['magenta','lightgreen','yellow','orange']
plt.rcParams['text.color'] = 'black'
# explsion
explode = (0.09,0.09,0.09,0.09)
plt.pie(sizes, colors = colors, labels=labels, autopct='%1.0f%%', startangle=90, pctdistance=0.85,
explode = explode)
# draw circle
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
# plot
ax1.axis('equal') # to make sure pie is drawn as a circle
plt.tight_layout()
plt.show()
/opt/conda/lib/python3.6/site-packages/matplotlib/font_manager.py:1316: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans (prop.get_family(), self.defaultFamily[fontext]))
Looking at the numbers of each breed represented via tweets, we see that "pupper" is the most represented with its frequency being 67%. In contrast, "floofer" frequency is 1% which is signficantly low compared to the others. This shows that there may be data extraction issues.
In order to see which breeds are most favorited by users, I will group breeds with their respective favorite counts.
user_fav_breed = twitter_archive_master.query('breed ! = "none"').groupby(['breed'])['favorite_count'].sum().sort_values(ascending = False)
# select top 5 only
top5 = user_fav_breed.head(5)
# bar chart
top5.plot(kind = 'bar')
plt.title('Most Favorited Breeds by Users')
plt.xlabel('Breed')
plt.ylabel('Favourite Count')
fig = plt.gcf()
fig.set_size_inches(15,7)
/opt/conda/lib/python3.6/site-packages/matplotlib/font_manager.py:1316: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans (prop.get_family(), self.defaultFamily[fontext]))
# save figure as file
fig.savefig('most_fav_breeds.png',bbox_inches='tight');
/opt/conda/lib/python3.6/site-packages/matplotlib/font_manager.py:1316: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans (prop.get_family(), self.defaultFamily[fontext]))
As exepcted, the breed with the highest favorite counts is Golden Retriever then followed by Labrador Retriever, Pembroke, Chihuahua. I expected to see Pug at the end but French Bulldog took the 5th most favorited breed.
End of Wrangle Act. Please see Wrangle Report.