Twitter API Web Scraping & Data Analysis

In [ ]:
__author__ = "Donald Ghazi"
__email__ = "donald@donaldghazi.com"
__website__ = "donaldghazi.com"

Project Details¶

  1. Data Wrangling
    • Gathering Data
    • Assessing Data
    • Cleaning Data
  2. Storing, Analyzing, and Visualizing wrangled data
  3. Reporting data wrangling efforts, data analyses, and visualizations

Data Wrangling¶

Gathering Data¶

  1. Twitter Archive Enhanced
  2. Image Predictions
  3. Twitter API Data
In [299]:
# import packages 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests
import tweepy
import json
In [300]:
# load given twitter archive
archive = pd.read_csv('twitter-archive-enhanced.csv')
In [301]:
# count number of tweets from the archive (number of tweets with ratings only)
archive.shape[0]
Out[301]:
2356
In [302]:
# sort and set df showing latest tweets first 
archive.sort_values('timestamp',ascending=False,inplace=True)
archive.head() #inspect the data 
Out[302]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
0 892420643555336193 NaN NaN 2017-08-01 16:23:56 +0000 <a href="http://twitter.com/download/iphone" r... This is Phineas. He's a mystical boy. Only eve... NaN NaN NaN https://twitter.com/dog_rates/status/892420643... 13 10 Phineas None None None None
1 892177421306343426 NaN NaN 2017-08-01 00:17:27 +0000 <a href="http://twitter.com/download/iphone" r... This is Tilly. She's just checking pup on you.... NaN NaN NaN https://twitter.com/dog_rates/status/892177421... 13 10 Tilly None None None None
2 891815181378084864 NaN NaN 2017-07-31 00:18:03 +0000 <a href="http://twitter.com/download/iphone" r... This is Archie. He is a rare Norwegian Pouncin... NaN NaN NaN https://twitter.com/dog_rates/status/891815181... 12 10 Archie None None None None
3 891689557279858688 NaN NaN 2017-07-30 15:58:51 +0000 <a href="http://twitter.com/download/iphone" r... This is Darla. She commenced a snooze mid meal... NaN NaN NaN https://twitter.com/dog_rates/status/891689557... 13 10 Darla None None None None
4 891327558926688256 NaN NaN 2017-07-29 16:00:24 +0000 <a href="http://twitter.com/download/iphone" r... This is Franklin. He would like you to stop ca... NaN NaN NaN https://twitter.com/dog_rates/status/891327558... 12 10 Franklin None None None None
In [303]:
# download image predictions from Udacity's server
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)

with open(url.split('/')[-1], mode='wb') as file:
    file.write(response.content)
In [304]:
# import and load the tsv file by setting the separator to (tab) into a df called images 
images = pd.read_csv('image-predictions.tsv', sep='\t')

# check to see if it was imported correctly 
images.head()
Out[304]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True
1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature_pinscher 0.074192 True Rhodesian_ridgeback 0.072010 True
2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German_shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True
3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg 1 Rhodesian_ridgeback 0.408143 True redbone 0.360687 True miniature_pinscher 0.222752 True
4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg 1 miniature_pinscher 0.560311 True Rottweiler 0.243682 True Doberman 0.154629 True
In [305]:
# count tweets from image-predictions.tsv (number of tweets with images)
images.shape[0]
Out[305]:
2075
In [306]:
# set keys to establish connection (*removed for submission)
consumer_key = 'XXXXX'
consumer_secret = 'XXXXX'
access_token = 'XXXXX'
access_token_secret = 'XXXX'
In [307]:
# configure connection (http://docs.tweepy.org/en/v3.5.0/auth_tutorial.html)
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

# create an API object 
api = tweepy.API(auth,
                 parser = tweepy.parsers.JSONParser(),
                 wait_on_rate_limit = True,
                 wait_on_rate_limit_notify = True)
In [308]:
# set 2 lists and download Tweepy Status objects (items) via API

# list contains tweets w/ tweet ids
tweets_list = []

# list contains ids w/o tweets 
no_tweets_list = []

count = 0
for tweet_id in archive['tweet_id']:   
    count += 1
    try:
        tweets_list.append(api.get_status(tweet_id))
    except Exception as e:
        no_tweets_list.append(tweet_id)
        
    # keep track of items processed (should say 2350 @ the end)
    if count % 50 == 0:
        print(str(count)+ " items processed ")
50 items processed 
100 items processed 
150 items processed 
200 items processed 
250 items processed 
300 items processed 
350 items processed 
400 items processed 
450 items processed 
500 items processed 
550 items processed 
600 items processed 
650 items processed 
700 items processed 
750 items processed 
800 items processed 
850 items processed 
900 items processed 
Rate limit reached. Sleeping for: 732
950 items processed 
1000 items processed 
1050 items processed 
1100 items processed 
1150 items processed 
1200 items processed 
1250 items processed 
1300 items processed 
1350 items processed 
1400 items processed 
1450 items processed 
1500 items processed 
1550 items processed 
1600 items processed 
1650 items processed 
1700 items processed 
1750 items processed 
1800 items processed 
Rate limit reached. Sleeping for: 734
1850 items processed 
1900 items processed 
1950 items processed 
2000 items processed 
2050 items processed 
2100 items processed 
2150 items processed 
2200 items processed 
2250 items processed 
2300 items processed 
2350 items processed 
In [309]:
# https://stackabuse.com/writing-files-using-python/

# create tweet_json.txt
with open('tweet_json.txt', 'w') as outfile:   # load JSON data we gathered 
        for tweet_json in tweets_list:
            json.dump(tweet_json, outfile)  # write data (what we constructed) to outfile 
            outfile.write('\n')   # save JSON data as tweets_json.txt file 
In [312]:
#https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object (for attributes)


# create a list of attributes 
attr_list = []

# open the text file and read as JSON file 
with open('tweet_json.txt', 'r') as json_file:
    
    # read the first line for the loop to start below
    line = json_file.readline()
    
    # create loop to get retweets and their links 
    while line:
        json_data = json.loads(line)
        retweeted_status = json_data['retweeted_status'] = json_data.get('retweeted_status', 'original')
        if retweeted_status == 'original':
            url = json_data['text'][json_data['text'].find('https'):]
        else:
            retweeted_status = 'retweet'
            url = 'retweet'
        
        # create a dictionary for the selected attributes 
        attributes = {'tweet_id': str(json_data['id']), 
                            'favorite_count': int(json_data['favorite_count']),
                            'retweet_count': int(json_data['retweet_count']),
                            'followers_count': int(json_data['user']['followers_count']),
                            'friends_count': int(json_data['user']['friends_count']),
                            'url': url,
                            'retweeted_status': str(retweeted_status)}
        
        # add above to our list
        attr_list.append(attributes)
        
        # read the next line of JSON data
        line = json_file.readline()

# convert the tweet JSON data dictionary list to a pandas DataFrame
    api_data = pd.DataFrame(attr_list, 
                               columns = ['tweet_id',
                                    'favorite_count',
                                    'retweet_count',
                                    'followers_count',
                                    'friends_count',
                                    'url',
                                    'retweeted_status'])
In [313]:
#inspect
api_data.head()
Out[313]:
tweet_id favorite_count retweet_count followers_count friends_count url retweeted_status
0 892420643555336193 36357 7738 8767944 15 https://t.co/MgUWQ76dJU original
1 892177421306343426 31325 5728 8767944 15 https://t.co/aQFSeaCu9L original
2 891815181378084864 23596 3789 8767944 15 https://t.co/r0YlrsGCgy original
3 891689557279858688 39665 7901 8767944 15 https://t.co/tD36da7qLQ original
4 891327558926688256 37866 8523 8767944 15 https://t.co/0g0KMIVXZ3 original

Assessing Data¶

Visual Assessment using Jupyter¶

As I've gathered each of the above pieces of data, I will assess them visually and programmatically for quality and tidiness issues.

In [314]:
archive
Out[314]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
0 892420643555336193 NaN NaN 2017-08-01 16:23:56 +0000 <a href="http://twitter.com/download/iphone" r... This is Phineas. He's a mystical boy. Only eve... NaN NaN NaN https://twitter.com/dog_rates/status/892420643... 13 10 Phineas None None None None
1 892177421306343426 NaN NaN 2017-08-01 00:17:27 +0000 <a href="http://twitter.com/download/iphone" r... This is Tilly. She's just checking pup on you.... NaN NaN NaN https://twitter.com/dog_rates/status/892177421... 13 10 Tilly None None None None
2 891815181378084864 NaN NaN 2017-07-31 00:18:03 +0000 <a href="http://twitter.com/download/iphone" r... This is Archie. He is a rare Norwegian Pouncin... NaN NaN NaN https://twitter.com/dog_rates/status/891815181... 12 10 Archie None None None None
3 891689557279858688 NaN NaN 2017-07-30 15:58:51 +0000 <a href="http://twitter.com/download/iphone" r... This is Darla. She commenced a snooze mid meal... NaN NaN NaN https://twitter.com/dog_rates/status/891689557... 13 10 Darla None None None None
4 891327558926688256 NaN NaN 2017-07-29 16:00:24 +0000 <a href="http://twitter.com/download/iphone" r... This is Franklin. He would like you to stop ca... NaN NaN NaN https://twitter.com/dog_rates/status/891327558... 12 10 Franklin None None None None
5 891087950875897856 NaN NaN 2017-07-29 00:08:17 +0000 <a href="http://twitter.com/download/iphone" r... Here we have a majestic great white breaching ... NaN NaN NaN https://twitter.com/dog_rates/status/891087950... 13 10 None None None None None
6 890971913173991426 NaN NaN 2017-07-28 16:27:12 +0000 <a href="http://twitter.com/download/iphone" r... Meet Jax. He enjoys ice cream so much he gets ... NaN NaN NaN https://gofundme.com/ydvmve-surgery-for-jax,ht... 13 10 Jax None None None None
7 890729181411237888 NaN NaN 2017-07-28 00:22:40 +0000 <a href="http://twitter.com/download/iphone" r... When you watch your owner call another dog a g... NaN NaN NaN https://twitter.com/dog_rates/status/890729181... 13 10 None None None None None
8 890609185150312448 NaN NaN 2017-07-27 16:25:51 +0000 <a href="http://twitter.com/download/iphone" r... This is Zoey. She doesn't want to be one of th... NaN NaN NaN https://twitter.com/dog_rates/status/890609185... 13 10 Zoey None None None None
9 890240255349198849 NaN NaN 2017-07-26 15:59:51 +0000 <a href="http://twitter.com/download/iphone" r... This is Cassie. She is a college pup. Studying... NaN NaN NaN https://twitter.com/dog_rates/status/890240255... 14 10 Cassie doggo None None None
10 890006608113172480 NaN NaN 2017-07-26 00:31:25 +0000 <a href="http://twitter.com/download/iphone" r... This is Koda. He is a South Australian decksha... NaN NaN NaN https://twitter.com/dog_rates/status/890006608... 13 10 Koda None None None None
11 889880896479866881 NaN NaN 2017-07-25 16:11:53 +0000 <a href="http://twitter.com/download/iphone" r... This is Bruno. He is a service shark. Only get... NaN NaN NaN https://twitter.com/dog_rates/status/889880896... 13 10 Bruno None None None None
12 889665388333682689 NaN NaN 2017-07-25 01:55:32 +0000 <a href="http://twitter.com/download/iphone" r... Here's a puppo that seems to be on the fence a... NaN NaN NaN https://twitter.com/dog_rates/status/889665388... 13 10 None None None None puppo
13 889638837579907072 NaN NaN 2017-07-25 00:10:02 +0000 <a href="http://twitter.com/download/iphone" r... This is Ted. He does his best. Sometimes that'... NaN NaN NaN https://twitter.com/dog_rates/status/889638837... 12 10 Ted None None None None
14 889531135344209921 NaN NaN 2017-07-24 17:02:04 +0000 <a href="http://twitter.com/download/iphone" r... This is Stuart. He's sporting his favorite fan... NaN NaN NaN https://twitter.com/dog_rates/status/889531135... 13 10 Stuart None None None puppo
15 889278841981685760 NaN NaN 2017-07-24 00:19:32 +0000 <a href="http://twitter.com/download/iphone" r... This is Oliver. You're witnessing one of his m... NaN NaN NaN https://twitter.com/dog_rates/status/889278841... 13 10 Oliver None None None None
16 888917238123831296 NaN NaN 2017-07-23 00:22:39 +0000 <a href="http://twitter.com/download/iphone" r... This is Jim. He found a fren. Taught him how t... NaN NaN NaN https://twitter.com/dog_rates/status/888917238... 12 10 Jim None None None None
17 888804989199671297 NaN NaN 2017-07-22 16:56:37 +0000 <a href="http://twitter.com/download/iphone" r... This is Zeke. He has a new stick. Very proud o... NaN NaN NaN https://twitter.com/dog_rates/status/888804989... 13 10 Zeke None None None None
18 888554962724278272 NaN NaN 2017-07-22 00:23:06 +0000 <a href="http://twitter.com/download/iphone" r... This is Ralphus. He's powering up. Attempting ... NaN NaN NaN https://twitter.com/dog_rates/status/888554962... 13 10 Ralphus None None None None
19 888202515573088257 NaN NaN 2017-07-21 01:02:36 +0000 <a href="http://twitter.com/download/iphone" r... RT @dog_rates: This is Canela. She attempted s... 8.874740e+17 4.196984e+09 2017-07-19 00:47:34 +0000 https://twitter.com/dog_rates/status/887473957... 13 10 Canela None None None None
20 888078434458587136 NaN NaN 2017-07-20 16:49:33 +0000 <a href="http://twitter.com/download/iphone" r... This is Gerald. He was just told he didn't get... NaN NaN NaN https://twitter.com/dog_rates/status/888078434... 12 10 Gerald None None None None
21 887705289381826560 NaN NaN 2017-07-19 16:06:48 +0000 <a href="http://twitter.com/download/iphone" r... This is Jeffrey. He has a monopoly on the pool... NaN NaN NaN https://twitter.com/dog_rates/status/887705289... 13 10 Jeffrey None None None None
22 887517139158093824 NaN NaN 2017-07-19 03:39:09 +0000 <a href="http://twitter.com/download/iphone" r... I've yet to rate a Venezuelan Hover Wiener. Th... NaN NaN NaN https://twitter.com/dog_rates/status/887517139... 14 10 such None None None None
23 887473957103951883 NaN NaN 2017-07-19 00:47:34 +0000 <a href="http://twitter.com/download/iphone" r... This is Canela. She attempted some fancy porch... NaN NaN NaN https://twitter.com/dog_rates/status/887473957... 13 10 Canela None None None None
24 887343217045368832 NaN NaN 2017-07-18 16:08:03 +0000 <a href="http://twitter.com/download/iphone" r... You may not have known you needed to see this ... NaN NaN NaN https://twitter.com/dog_rates/status/887343217... 13 10 None None None None None
25 887101392804085760 NaN NaN 2017-07-18 00:07:08 +0000 <a href="http://twitter.com/download/iphone" r... This... is a Jubilant Antarctic House Bear. We... NaN NaN NaN https://twitter.com/dog_rates/status/887101392... 12 10 None None None None None
26 886983233522544640 NaN NaN 2017-07-17 16:17:36 +0000 <a href="http://twitter.com/download/iphone" r... This is Maya. She's very shy. Rarely leaves he... NaN NaN NaN https://twitter.com/dog_rates/status/886983233... 13 10 Maya None None None None
27 886736880519319552 NaN NaN 2017-07-16 23:58:41 +0000 <a href="http://twitter.com/download/iphone" r... This is Mingus. He's a wonderful father to his... NaN NaN NaN https://www.gofundme.com/mingusneedsus,https:/... 13 10 Mingus None None None None
28 886680336477933568 NaN NaN 2017-07-16 20:14:00 +0000 <a href="http://twitter.com/download/iphone" r... This is Derek. He's late for a dog meeting. 13... NaN NaN NaN https://twitter.com/dog_rates/status/886680336... 13 10 Derek None None None None
29 886366144734445568 NaN NaN 2017-07-15 23:25:31 +0000 <a href="http://twitter.com/download/iphone" r... This is Roscoe. Another pupper fallen victim t... NaN NaN NaN https://twitter.com/dog_rates/status/886366144... 12 10 Roscoe None None pupper None
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2326 666411507551481857 NaN NaN 2015-11-17 00:24:19 +0000 <a href="http://twitter.com/download/iphone" r... This is quite the dog. Gets really excited whe... NaN NaN NaN https://twitter.com/dog_rates/status/666411507... 2 10 quite None None None None
2327 666407126856765440 NaN NaN 2015-11-17 00:06:54 +0000 <a href="http://twitter.com/download/iphone" r... This is a southern Vesuvius bumblegruff. Can d... NaN NaN NaN https://twitter.com/dog_rates/status/666407126... 7 10 a None None None None
2328 666396247373291520 NaN NaN 2015-11-16 23:23:41 +0000 <a href="http://twitter.com/download/iphone" r... Oh goodness. A super rare northeast Qdoba kang... NaN NaN NaN https://twitter.com/dog_rates/status/666396247... 9 10 None None None None None
2329 666373753744588802 NaN NaN 2015-11-16 21:54:18 +0000 <a href="http://twitter.com/download/iphone" r... Those are sunglasses and a jean jacket. 11/10 ... NaN NaN NaN https://twitter.com/dog_rates/status/666373753... 11 10 None None None None None
2330 666362758909284353 NaN NaN 2015-11-16 21:10:36 +0000 <a href="http://twitter.com/download/iphone" r... Unique dog here. Very small. Lives in containe... NaN NaN NaN https://twitter.com/dog_rates/status/666362758... 6 10 None None None None None
2331 666353288456101888 NaN NaN 2015-11-16 20:32:58 +0000 <a href="http://twitter.com/download/iphone" r... Here we have a mixed Asiago from the Galápagos... NaN NaN NaN https://twitter.com/dog_rates/status/666353288... 8 10 None None None None None
2332 666345417576210432 NaN NaN 2015-11-16 20:01:42 +0000 <a href="http://twitter.com/download/iphone" r... Look at this jokester thinking seat belt laws ... NaN NaN NaN https://twitter.com/dog_rates/status/666345417... 10 10 None None None None None
2333 666337882303524864 NaN NaN 2015-11-16 19:31:45 +0000 <a href="http://twitter.com/download/iphone" r... This is an extremely rare horned Parthenon. No... NaN NaN NaN https://twitter.com/dog_rates/status/666337882... 9 10 an None None None None
2334 666293911632134144 NaN NaN 2015-11-16 16:37:02 +0000 <a href="http://twitter.com/download/iphone" r... This is a funny dog. Weird toes. Won't come do... NaN NaN NaN https://twitter.com/dog_rates/status/666293911... 3 10 a None None None None
2335 666287406224695296 NaN NaN 2015-11-16 16:11:11 +0000 <a href="http://twitter.com/download/iphone" r... This is an Albanian 3 1/2 legged Episcopalian... NaN NaN NaN https://twitter.com/dog_rates/status/666287406... 1 2 an None None None None
2336 666273097616637952 NaN NaN 2015-11-16 15:14:19 +0000 <a href="http://twitter.com/download/iphone" r... Can take selfies 11/10 https://t.co/ws2AMaNwPW NaN NaN NaN https://twitter.com/dog_rates/status/666273097... 11 10 None None None None None
2337 666268910803644416 NaN NaN 2015-11-16 14:57:41 +0000 <a href="http://twitter.com/download/iphone" r... Very concerned about fellow dog trapped in com... NaN NaN NaN https://twitter.com/dog_rates/status/666268910... 10 10 None None None None None
2338 666104133288665088 NaN NaN 2015-11-16 04:02:55 +0000 <a href="http://twitter.com/download/iphone" r... Not familiar with this breed. No tail (weird).... NaN NaN NaN https://twitter.com/dog_rates/status/666104133... 1 10 None None None None None
2339 666102155909144576 NaN NaN 2015-11-16 03:55:04 +0000 <a href="http://twitter.com/download/iphone" r... Oh my. Here you are seeing an Adobe Setter giv... NaN NaN NaN https://twitter.com/dog_rates/status/666102155... 11 10 None None None None None
2340 666099513787052032 NaN NaN 2015-11-16 03:44:34 +0000 <a href="http://twitter.com/download/iphone" r... Can stand on stump for what seems like a while... NaN NaN NaN https://twitter.com/dog_rates/status/666099513... 8 10 None None None None None
2341 666094000022159362 NaN NaN 2015-11-16 03:22:39 +0000 <a href="http://twitter.com/download/iphone" r... This appears to be a Mongolian Presbyterian mi... NaN NaN NaN https://twitter.com/dog_rates/status/666094000... 9 10 None None None None None
2342 666082916733198337 NaN NaN 2015-11-16 02:38:37 +0000 <a href="http://twitter.com/download/iphone" r... Here we have a well-established sunblockerspan... NaN NaN NaN https://twitter.com/dog_rates/status/666082916... 6 10 None None None None None
2343 666073100786774016 NaN NaN 2015-11-16 01:59:36 +0000 <a href="http://twitter.com/download/iphone" r... Let's hope this flight isn't Malaysian (lol). ... NaN NaN NaN https://twitter.com/dog_rates/status/666073100... 10 10 None None None None None
2344 666071193221509120 NaN NaN 2015-11-16 01:52:02 +0000 <a href="http://twitter.com/download/iphone" r... Here we have a northern speckled Rhododendron.... NaN NaN NaN https://twitter.com/dog_rates/status/666071193... 9 10 None None None None None
2345 666063827256086533 NaN NaN 2015-11-16 01:22:45 +0000 <a href="http://twitter.com/download/iphone" r... This is the happiest dog you will ever see. Ve... NaN NaN NaN https://twitter.com/dog_rates/status/666063827... 10 10 the None None None None
2346 666058600524156928 NaN NaN 2015-11-16 01:01:59 +0000 <a href="http://twitter.com/download/iphone" r... Here is the Rand Paul of retrievers folks! He'... NaN NaN NaN https://twitter.com/dog_rates/status/666058600... 8 10 the None None None None
2347 666057090499244032 NaN NaN 2015-11-16 00:55:59 +0000 <a href="http://twitter.com/download/iphone" r... My oh my. This is a rare blond Canadian terrie... NaN NaN NaN https://twitter.com/dog_rates/status/666057090... 9 10 a None None None None
2348 666055525042405380 NaN NaN 2015-11-16 00:49:46 +0000 <a href="http://twitter.com/download/iphone" r... Here is a Siberian heavily armored polar bear ... NaN NaN NaN https://twitter.com/dog_rates/status/666055525... 10 10 a None None None None
2349 666051853826850816 NaN NaN 2015-11-16 00:35:11 +0000 <a href="http://twitter.com/download/iphone" r... This is an odd dog. Hard on the outside but lo... NaN NaN NaN https://twitter.com/dog_rates/status/666051853... 2 10 an None None None None
2350 666050758794694657 NaN NaN 2015-11-16 00:30:50 +0000 <a href="http://twitter.com/download/iphone" r... This is a truly beautiful English Wilson Staff... NaN NaN NaN https://twitter.com/dog_rates/status/666050758... 10 10 a None None None None
2351 666049248165822465 NaN NaN 2015-11-16 00:24:50 +0000 <a href="http://twitter.com/download/iphone" r... Here we have a 1949 1st generation vulpix. Enj... NaN NaN NaN https://twitter.com/dog_rates/status/666049248... 5 10 None None None None None
2352 666044226329800704 NaN NaN 2015-11-16 00:04:52 +0000 <a href="http://twitter.com/download/iphone" r... This is a purebred Piers Morgan. Loves to Netf... NaN NaN NaN https://twitter.com/dog_rates/status/666044226... 6 10 a None None None None
2353 666033412701032449 NaN NaN 2015-11-15 23:21:54 +0000 <a href="http://twitter.com/download/iphone" r... Here is a very happy pup. Big fan of well-main... NaN NaN NaN https://twitter.com/dog_rates/status/666033412... 9 10 a None None None None
2354 666029285002620928 NaN NaN 2015-11-15 23:05:30 +0000 <a href="http://twitter.com/download/iphone" r... This is a western brown Mitsubishi terrier. Up... NaN NaN NaN https://twitter.com/dog_rates/status/666029285... 7 10 a None None None None
2355 666020888022790149 NaN NaN 2015-11-15 22:32:08 +0000 <a href="http://twitter.com/download/iphone" r... Here we have a Japanese Irish Setter. Lost eye... NaN NaN NaN https://twitter.com/dog_rates/status/666020888... 8 10 None None None None None

2356 rows × 17 columns

In [317]:
api_data 
Out[317]:
tweet_id favorite_count retweet_count followers_count friends_count url retweeted_status
0 892420643555336193 36357 7738 8767944 15 https://t.co/MgUWQ76dJU original
1 892177421306343426 31325 5728 8767944 15 https://t.co/aQFSeaCu9L original
2 891815181378084864 23596 3789 8767944 15 https://t.co/r0YlrsGCgy original
3 891689557279858688 39665 7901 8767944 15 https://t.co/tD36da7qLQ original
4 891327558926688256 37866 8523 8767944 15 https://t.co/0g0KMIVXZ3 original
5 891087950875897856 19070 2852 8767944 15 https://t.co/xx5cilW0Dd original
6 890971913173991426 11108 1858 8767944 15 https://t.co/MV01Q820LT original
7 890729181411237888 61264 17266 8767944 15 https://t.co/hrcFOGi12V original
8 890609185150312448 26248 3933 8767944 15 https://t.co/UkrdQyoYxV original
9 890240255349198849 29985 6728 8767944 15 https://t.co/l3TSS3o2M0 original
10 890006608113172480 28878 6719 8767944 15 https://t.co/jy2alqaYPH original
11 889880896479866881 26239 4569 8767944 15 https://t.co/u1XPQMl29g original
12 889665388333682689 45211 9164 8767944 15 https://t.co/BxvuXk0UCm original
13 889638837579907072 25448 4114 8767944 15 https://t.co/f8dEDcrKSR original
14 889531135344209921 14233 2063 8767944 15 https://t.co/Ya3bdfYuCx original
15 889278841981685760 23712 4883 8767944 15 https://t.co/qw5xRwBaIv original
16 888917238123831296 27406 4123 8767944 15 https://t.co/chxruIOUJN original
17 888804989199671297 24076 3888 8767944 15 https://t.co/L5cCoR2huQ original
18 888554962724278272 18597 3190 8767944 15 https://t.co/YnYAFCTTiK original
19 888078434458587136 20501 3170 8767944 15 https://t.co/UibDBnVpvR original
20 887705289381826560 28459 4932 8767944 15 https://t.co/m7BR4JgDrZ original
21 887517139158093824 43667 10784 8767944 15 https://t.co/20VrLAA8ba original
22 887473957103951883 64740 16494 8767944 15 https://t.co/cLyzpcUcMX original
23 887343217045368832 31676 9585 8767944 15 https://t.co/WZqNqygEyV original
24 887101392804085760 28825 5456 8767944 15 https://t.co/UgW2xbO7Rg original
25 886983233522544640 32909 7025 8767944 15 https://t.co/I6oNy0CgiT original
26 886736880519319552 11293 2945 8767944 15 https://t.co/RipKYZ0T1u original
27 886680336477933568 21159 4090 8767944 15 https://t.co/BCoWue0abA original
28 886366144734445568 19915 2900 8767944 15 https://t.co/iIu3AF8IWH original
29 886267009285017600 113 4 8767944 15 n original
... ... ... ... ... ... ... ...
2301 666411507551481857 414 302 8767948 15 https://t.co/aMCTNWO94t original
2302 666407126856765440 100 32 8767948 15 https://t.co/LopTBkKa8h original
2303 666396247373291520 159 76 8767948 15 https://t.co/Dc7b0E8qFE original
2304 666373753744588802 174 85 8767948 15 https://t.co/uHXrPkUEyl original
2305 666362758909284353 729 522 8767948 15 https://t.co/XMD9CwjEnM original
2306 666353288456101888 200 66 8767948 15 https://t.co/tltQ5w9aUO original
2307 666345417576210432 275 128 8767948 15 https://t.co/VFKG1vxGjB original
2308 666337882303524864 182 84 8767948 15 https://t.co/QpRjllzWAL original
2309 666293911632134144 469 322 8767948 15 https://t.co/IIXis0zta0 original
2310 666287406224695296 138 62 8767948 15 https://t.co/d9NcXFKwLv original
2311 666273097616637952 161 73 8767948 15 https://t.co/ws2AMaNwPW original
2312 666268910803644416 96 32 8767948 15 https://t.co/0yxApIikpk original
2313 666104133288665088 13711 6011 8767948 15 https://t.co/Asgdc6kuLX original
2314 666102155909144576 72 11 8767948 15 https://t.co/11LvqN4WLq original
2315 666099513787052032 143 61 8767948 15 https://t.co/Ri4nMTLq5C original
2316 666094000022159362 154 68 8767948 15 https://t.co/mnioXo3IfP original
2317 666082916733198337 103 42 8767948 15 https://t.co/3RU6x0vHB7 original
2318 666073100786774016 298 147 8767948 15 https://t.co/Yk6GHE9tOY original
2319 666071193221509120 136 54 8767948 15 https://t.co/ZoL8kq2XFx original
2320 666063827256086533 450 199 8767948 15 https://t.co/RhUEAloehK original
2321 666058600524156928 105 54 8767948 15 https://t.co/pYAJkAe76p original
2322 666057090499244032 270 128 8767948 15 https://t.co/yWBqbrzy8O original
2323 666055525042405380 411 222 8767948 15 https://t.co/rdivxLiqEt original
2324 666051853826850816 1129 781 8767948 15 https://t.co/v5A4vzSDdc original
2325 666050758794694657 124 51 8767948 15 https://t.co/fvIbQfHjIe original
2326 666049248165822465 96 41 8767948 15 https://t.co/4B7cOc1EDq original
2327 666044226329800704 273 132 8767948 15 https://t.co/DWnyCjf2mx original
2328 666033412701032449 112 41 8767948 15 https://t.co/y671yMhoiR original
2329 666029285002620928 121 42 8767948 15 https://t.co/r7mOb2m0UI original
2330 666020888022790149 2424 465 8767948 15 https://t.co/BLDqew2Ijj original

2331 rows × 7 columns

In [318]:
images
Out[318]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True
1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature_pinscher 0.074192 True Rhodesian_ridgeback 0.072010 True
2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German_shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True
3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg 1 Rhodesian_ridgeback 0.408143 True redbone 0.360687 True miniature_pinscher 0.222752 True
4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg 1 miniature_pinscher 0.560311 True Rottweiler 0.243682 True Doberman 0.154629 True
5 666050758794694657 https://pbs.twimg.com/media/CT5Jof1WUAEuVxN.jpg 1 Bernese_mountain_dog 0.651137 True English_springer 0.263788 True Greater_Swiss_Mountain_dog 0.016199 True
6 666051853826850816 https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg 1 box_turtle 0.933012 False mud_turtle 0.045885 False terrapin 0.017885 False
7 666055525042405380 https://pbs.twimg.com/media/CT5N9tpXIAAifs1.jpg 1 chow 0.692517 True Tibetan_mastiff 0.058279 True fur_coat 0.054449 False
8 666057090499244032 https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg 1 shopping_cart 0.962465 False shopping_basket 0.014594 False golden_retriever 0.007959 True
9 666058600524156928 https://pbs.twimg.com/media/CT5Qw94XAAA_2dP.jpg 1 miniature_poodle 0.201493 True komondor 0.192305 True soft-coated_wheaten_terrier 0.082086 True
10 666063827256086533 https://pbs.twimg.com/media/CT5Vg_wXIAAXfnj.jpg 1 golden_retriever 0.775930 True Tibetan_mastiff 0.093718 True Labrador_retriever 0.072427 True
11 666071193221509120 https://pbs.twimg.com/media/CT5cN_3WEAAlOoZ.jpg 1 Gordon_setter 0.503672 True Yorkshire_terrier 0.174201 True Pekinese 0.109454 True
12 666073100786774016 https://pbs.twimg.com/media/CT5d9DZXAAALcwe.jpg 1 Walker_hound 0.260857 True English_foxhound 0.175382 True Ibizan_hound 0.097471 True
13 666082916733198337 https://pbs.twimg.com/media/CT5m4VGWEAAtKc8.jpg 1 pug 0.489814 True bull_mastiff 0.404722 True French_bulldog 0.048960 True
14 666094000022159362 https://pbs.twimg.com/media/CT5w9gUW4AAsBNN.jpg 1 bloodhound 0.195217 True German_shepherd 0.078260 True malinois 0.075628 True
15 666099513787052032 https://pbs.twimg.com/media/CT51-JJUEAA6hV8.jpg 1 Lhasa 0.582330 True Shih-Tzu 0.166192 True Dandie_Dinmont 0.089688 True
16 666102155909144576 https://pbs.twimg.com/media/CT54YGiWUAEZnoK.jpg 1 English_setter 0.298617 True Newfoundland 0.149842 True borzoi 0.133649 True
17 666104133288665088 https://pbs.twimg.com/media/CT56LSZWoAAlJj2.jpg 1 hen 0.965932 False cock 0.033919 False partridge 0.000052 False
18 666268910803644416 https://pbs.twimg.com/media/CT8QCd1WEAADXws.jpg 1 desktop_computer 0.086502 False desk 0.085547 False bookcase 0.079480 False
19 666273097616637952 https://pbs.twimg.com/media/CT8T1mtUwAA3aqm.jpg 1 Italian_greyhound 0.176053 True toy_terrier 0.111884 True basenji 0.111152 True
20 666287406224695296 https://pbs.twimg.com/media/CT8g3BpUEAAuFjg.jpg 1 Maltese_dog 0.857531 True toy_poodle 0.063064 True miniature_poodle 0.025581 True
21 666293911632134144 https://pbs.twimg.com/media/CT8mx7KW4AEQu8N.jpg 1 three-toed_sloth 0.914671 False otter 0.015250 False great_grey_owl 0.013207 False
22 666337882303524864 https://pbs.twimg.com/media/CT9OwFIWEAMuRje.jpg 1 ox 0.416669 False Newfoundland 0.278407 True groenendael 0.102643 True
23 666345417576210432 https://pbs.twimg.com/media/CT9Vn7PWoAA_ZCM.jpg 1 golden_retriever 0.858744 True Chesapeake_Bay_retriever 0.054787 True Labrador_retriever 0.014241 True
24 666353288456101888 https://pbs.twimg.com/media/CT9cx0tUEAAhNN_.jpg 1 malamute 0.336874 True Siberian_husky 0.147655 True Eskimo_dog 0.093412 True
25 666362758909284353 https://pbs.twimg.com/media/CT9lXGsUcAAyUFt.jpg 1 guinea_pig 0.996496 False skunk 0.002402 False hamster 0.000461 False
26 666373753744588802 https://pbs.twimg.com/media/CT9vZEYWUAAlZ05.jpg 1 soft-coated_wheaten_terrier 0.326467 True Afghan_hound 0.259551 True briard 0.206803 True
27 666396247373291520 https://pbs.twimg.com/media/CT-D2ZHWIAA3gK1.jpg 1 Chihuahua 0.978108 True toy_terrier 0.009397 True papillon 0.004577 True
28 666407126856765440 https://pbs.twimg.com/media/CT-NvwmW4AAugGZ.jpg 1 black-and-tan_coonhound 0.529139 True bloodhound 0.244220 True flat-coated_retriever 0.173810 True
29 666411507551481857 https://pbs.twimg.com/media/CT-RugiWIAELEaq.jpg 1 coho 0.404640 False barracouta 0.271485 False gar 0.189945 False
... ... ... ... ... ... ... ... ... ... ... ... ...
2045 886366144734445568 https://pbs.twimg.com/media/DE0BTnQUwAApKEH.jpg 1 French_bulldog 0.999201 True Chihuahua 0.000361 True Boston_bull 0.000076 True
2046 886680336477933568 https://pbs.twimg.com/media/DE4fEDzWAAAyHMM.jpg 1 convertible 0.738995 False sports_car 0.139952 False car_wheel 0.044173 False
2047 886736880519319552 https://pbs.twimg.com/media/DE5Se8FXcAAJFx4.jpg 1 kuvasz 0.309706 True Great_Pyrenees 0.186136 True Dandie_Dinmont 0.086346 True
2048 886983233522544640 https://pbs.twimg.com/media/DE8yicJW0AAAvBJ.jpg 2 Chihuahua 0.793469 True toy_terrier 0.143528 True can_opener 0.032253 False
2049 887101392804085760 https://pbs.twimg.com/media/DE-eAq6UwAA-jaE.jpg 1 Samoyed 0.733942 True Eskimo_dog 0.035029 True Staffordshire_bullterrier 0.029705 True
2050 887343217045368832 https://pbs.twimg.com/ext_tw_video_thumb/88734... 1 Mexican_hairless 0.330741 True sea_lion 0.275645 False Weimaraner 0.134203 True
2051 887473957103951883 https://pbs.twimg.com/media/DFDw2tyUQAAAFke.jpg 2 Pembroke 0.809197 True Rhodesian_ridgeback 0.054950 True beagle 0.038915 True
2052 887517139158093824 https://pbs.twimg.com/ext_tw_video_thumb/88751... 1 limousine 0.130432 False tow_truck 0.029175 False shopping_cart 0.026321 False
2053 887705289381826560 https://pbs.twimg.com/media/DFHDQBbXgAEqY7t.jpg 1 basset 0.821664 True redbone 0.087582 True Weimaraner 0.026236 True
2054 888078434458587136 https://pbs.twimg.com/media/DFMWn56WsAAkA7B.jpg 1 French_bulldog 0.995026 True pug 0.000932 True bull_mastiff 0.000903 True
2055 888202515573088257 https://pbs.twimg.com/media/DFDw2tyUQAAAFke.jpg 2 Pembroke 0.809197 True Rhodesian_ridgeback 0.054950 True beagle 0.038915 True
2056 888554962724278272 https://pbs.twimg.com/media/DFTH_O-UQAACu20.jpg 3 Siberian_husky 0.700377 True Eskimo_dog 0.166511 True malamute 0.111411 True
2057 888804989199671297 https://pbs.twimg.com/media/DFWra-3VYAA2piG.jpg 1 golden_retriever 0.469760 True Labrador_retriever 0.184172 True English_setter 0.073482 True
2058 888917238123831296 https://pbs.twimg.com/media/DFYRgsOUQAARGhO.jpg 1 golden_retriever 0.714719 True Tibetan_mastiff 0.120184 True Labrador_retriever 0.105506 True
2059 889278841981685760 https://pbs.twimg.com/ext_tw_video_thumb/88927... 1 whippet 0.626152 True borzoi 0.194742 True Saluki 0.027351 True
2060 889531135344209921 https://pbs.twimg.com/media/DFg_2PVW0AEHN3p.jpg 1 golden_retriever 0.953442 True Labrador_retriever 0.013834 True redbone 0.007958 True
2061 889638837579907072 https://pbs.twimg.com/media/DFihzFfXsAYGDPR.jpg 1 French_bulldog 0.991650 True boxer 0.002129 True Staffordshire_bullterrier 0.001498 True
2062 889665388333682689 https://pbs.twimg.com/media/DFi579UWsAAatzw.jpg 1 Pembroke 0.966327 True Cardigan 0.027356 True basenji 0.004633 True
2063 889880896479866881 https://pbs.twimg.com/media/DFl99B1WsAITKsg.jpg 1 French_bulldog 0.377417 True Labrador_retriever 0.151317 True muzzle 0.082981 False
2064 890006608113172480 https://pbs.twimg.com/media/DFnwSY4WAAAMliS.jpg 1 Samoyed 0.957979 True Pomeranian 0.013884 True chow 0.008167 True
2065 890240255349198849 https://pbs.twimg.com/media/DFrEyVuW0AAO3t9.jpg 1 Pembroke 0.511319 True Cardigan 0.451038 True Chihuahua 0.029248 True
2066 890609185150312448 https://pbs.twimg.com/media/DFwUU__XcAEpyXI.jpg 1 Irish_terrier 0.487574 True Irish_setter 0.193054 True Chesapeake_Bay_retriever 0.118184 True
2067 890729181411237888 https://pbs.twimg.com/media/DFyBahAVwAAhUTd.jpg 2 Pomeranian 0.566142 True Eskimo_dog 0.178406 True Pembroke 0.076507 True
2068 890971913173991426 https://pbs.twimg.com/media/DF1eOmZXUAALUcq.jpg 1 Appenzeller 0.341703 True Border_collie 0.199287 True ice_lolly 0.193548 False
2069 891087950875897856 https://pbs.twimg.com/media/DF3HwyEWsAABqE6.jpg 1 Chesapeake_Bay_retriever 0.425595 True Irish_terrier 0.116317 True Indian_elephant 0.076902 False
2070 891327558926688256 https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg 2 basset 0.555712 True English_springer 0.225770 True German_short-haired_pointer 0.175219 True
2071 891689557279858688 https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg 1 paper_towel 0.170278 False Labrador_retriever 0.168086 True spatula 0.040836 False
2072 891815181378084864 https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg 1 Chihuahua 0.716012 True malamute 0.078253 True kelpie 0.031379 True
2073 892177421306343426 https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg 1 Chihuahua 0.323581 True Pekinese 0.090647 True papillon 0.068957 True
2074 892420643555336193 https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg 1 orange 0.097049 False bagel 0.085851 False banana 0.076110 False

2075 rows × 12 columns

Programmatic Assessment¶

In [394]:
archive.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), object(10)
memory usage: 411.3+ KB
In [395]:
sum(archive['tweet_id'].duplicated())
Out[395]:
0
In [396]:
# check timestamp column type
type(archive.timestamp[0])
Out[396]:
str

Issues

  • timestamp column is in string format
  • needs to be in datetime format
  • tweet_id column contains floats
  • needs to be strings b/c these are identification numbers *not for calcuations
  • 181 of the tweets are retweets
  • need to remove these 181 rows because they are essentailly duplicates of original
  • 78 of the tweets are replies to other tweets
  • need to remove these 78 rows b/c it doesn't help with our project motivation
  • 4 columns for dog stages
  • this can be better organized by just having 1 column and have 4 different values
In [322]:
# check the ratings
archive.rating_numerator.value_counts()
Out[322]:
12      558
11      464
10      461
13      351
9       158
8       102
7        55
14       54
5        37
6        32
3        19
4        17
1         9
2         9
420       2
0         2
15        2
75        2
80        1
20        1
24        1
26        1
44        1
50        1
60        1
165       1
84        1
88        1
144       1
182       1
143       1
666       1
960       1
1776      1
17        1
27        1
45        1
99        1
121       1
204       1
Name: rating_numerator, dtype: int64

Above shows unique values where 1776 being the highest and weird value like 666, the devil's number.

In [323]:
#inspect highest rating 
archive.query("rating_numerator == 1776")
Out[323]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
979 749981277374128128 NaN NaN 2016-07-04 15:00:45 +0000 <a href="https://about.twitter.com/products/tw... This is Atticus. He's quite simply America af.... NaN NaN NaN https://twitter.com/dog_rates/status/749981277... 1776 10 Atticus None None None None
In [324]:
# second highest rating
archive.query("rating_numerator == 960")
Out[324]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
313 835246439529840640 8.352460e+17 26259576.0 2017-02-24 21:54:03 +0000 <a href="http://twitter.com/download/iphone" r... @jonnysun @Lin_Manuel ok jomny I know you're e... NaN NaN NaN NaN 960 0 None None None None None

Above is retweet, showing that ratings vary greatly and not consistent throughout. Hints that we can use 10 as a common denominator.

In [325]:
# check for tweets that were extracted incorrectly 

# check to see if tweets contain decimals, leading to high values or value like "666" vs 6.66 (in text)
archive[archive['text'].str.contains(r'(\d+\.\d+\/\d+)')] 
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:3: UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
  This is separate from the ipykernel package so we can avoid doing imports until
Out[325]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
45 883482846933004288 NaN NaN 2017-07-08 00:28:19 +0000 <a href="http://twitter.com/download/iphone" r... This is Bella. She hopes her smile made you sm... NaN NaN NaN https://twitter.com/dog_rates/status/883482846... 5 10 Bella None None None None
340 832215909146226688 NaN NaN 2017-02-16 13:11:49 +0000 <a href="http://twitter.com/download/iphone" r... RT @dog_rates: This is Logan, the Chow who liv... 7.867091e+17 4.196984e+09 2016-10-13 23:23:56 +0000 https://twitter.com/dog_rates/status/786709082... 75 10 Logan None None None None
695 786709082849828864 NaN NaN 2016-10-13 23:23:56 +0000 <a href="http://twitter.com/download/iphone" r... This is Logan, the Chow who lived. He solemnly... NaN NaN NaN https://twitter.com/dog_rates/status/786709082... 75 10 Logan None None None None
763 778027034220126208 NaN NaN 2016-09-20 00:24:34 +0000 <a href="http://twitter.com/download/iphone" r... This is Sophie. She's a Jubilant Bush Pupper. ... NaN NaN NaN https://twitter.com/dog_rates/status/778027034... 27 10 Sophie None None pupper None
1689 681340665377193984 6.813394e+17 4.196984e+09 2015-12-28 05:07:27 +0000 <a href="http://twitter.com/download/iphone" r... I've been told there's a slight possibility he... NaN NaN NaN NaN 5 10 None None None None None
1712 680494726643068929 NaN NaN 2015-12-25 21:06:00 +0000 <a href="http://twitter.com/download/iphone" r... Here we have uncovered an entire battalion of ... NaN NaN NaN https://twitter.com/dog_rates/status/680494726... 26 10 None None None None None

Issues

  • Extraction isssue
  • Dataframe needs to be cleaned where ratings mirror original tweets
  • Numerator isssue
  • Needs to be in floats (* i.e., 11.26 was extracted incorrectly and gets a score of 26)
  • However, cannot change values given (* i.e., 1776 is an actual rating and not a typo)
In [326]:
# check the rating_denominator column
archive.rating_denominator.value_counts()
Out[326]:
10     2333
11        3
50        3
80        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64
In [327]:
# query select the tweet where denominator is 0 and view it's tweet ('text')
archive.query("rating_denominator == 0")['text']
Out[327]:
313    @jonnysun @Lin_Manuel ok jomny I know you're e...
Name: text, dtype: object
In [328]:
archive.query("rating_denominator == 7")['text']
Out[328]:
516    Meet Sam. She smiles 24/7 &amp; secretly aspir...
Name: text, dtype: object
In [329]:
archive.query("rating_denominator == 170")['text']
Out[329]:
1120    Say hello to this unbelievably well behaved sq...
Name: text, dtype: object

Issues

  • Denominator issues
  • Set common denominator as 10
  • Needs to be in floats too to since our numerators need to be changed to floats
  • Project Moviation states that "[t]he fact that the rating numerators are greater than the denominators does not need to be cleaned".
In [330]:
# check the name column
archive.name.value_counts()
Out[330]:
None        745
a            55
Charlie      12
Lucy         11
Oliver       11
Cooper       11
Tucker       10
Lola         10
Penny        10
Bo            9
Winston       9
the           8
Sadie         8
Daisy         7
Bailey        7
Toby          7
an            7
Buddy         7
Leo           6
Stanley       6
Bella         6
Jack          6
Milo          6
Koda          6
Jax           6
Scout         6
Dave          6
Rusty         6
Oscar         6
Alfie         5
           ... 
Tilly         1
Iggy          1
Maks          1
Vince         1
Monster       1
Damon         1
Chaz          1
Acro          1
Beya          1
Sunshine      1
Erik          1
Marq          1
Aqua          1
BeBe          1
Timmy         1
Teddy         1
Charl         1
Jimbo         1
Schnozz       1
Goliath       1
Carter        1
Claude        1
Mack          1
Andy          1
Grizz         1
Anthony       1
Kuyu          1
Skye          1
Perry         1
Jerome        1
Name: name, Length: 957, dtype: int64

Above shows lots of questionable "names" like "a" and "the" which are non-names, signaling extraction error.

In [397]:
# check names that aren't capitalized b/c names should be capitalized 

# use regex where names start with lower-case letters 
non_capital_names = archive.name.str.contains(pat='^[a-z]', regex = True) 
archive[non_capital_names].name.value_counts()
Out[397]:
a               55
the              8
an               7
very             5
just             4
one              4
quite            4
not              2
getting          2
actually         2
mad              2
his              1
unacceptable     1
by               1
space            1
light            1
all              1
my               1
old              1
such             1
life             1
incredibly       1
this             1
officially       1
infuriating      1
Name: name, dtype: int64

Issues

  • Name isssues
  • Another extraction error
  • Need to get rid of non-capital names, which aren't names afterall
In [398]:
images.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB

Issues

  • tweet_id column should in strings
In [399]:
# evaluate a sample of the image predictions
images.sample(5)
Out[399]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
1899 851224888060895234 https://pbs.twimg.com/media/C9AohFoWsAUmxDs.jpg 3 car_mirror 0.971512 False seat_belt 0.007063 False standard_poodle 0.005683 True
1199 740711788199743490 https://pbs.twimg.com/media/CkeJcNkXEAAcrks.jpg 1 toy_poodle 0.388277 True Angora 0.180264 False Persian_cat 0.049656 False
381 673317986296586240 https://pbs.twimg.com/media/CVgbIobUYAEaeI3.jpg 2 miniature_pinscher 0.384099 True bloodhound 0.079923 True Rottweiler 0.068594 True
15 666099513787052032 https://pbs.twimg.com/media/CT51-JJUEAA6hV8.jpg 1 Lhasa 0.582330 True Shih-Tzu 0.166192 True Dandie_Dinmont 0.089688 True
131 668297328638447616 https://pbs.twimg.com/media/CUZE4IWW4AAZmDf.jpg 1 king_penguin 0.606747 False ice_bear 0.264221 False Eskimo_dog 0.032784 True
In [400]:
# check for duplicates of url 
images.jpg_url.duplicated().value_counts()
Out[400]:
False    2009
True       66
Name: jpg_url, dtype: int64

Issues

  • 66 duplicated jpg_urls
In [401]:
# inspect what we generated from Twitter API 
api_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2331 entries, 0 to 2330
Data columns (total 7 columns):
tweet_id            2331 non-null object
favorite_count      2331 non-null int64
retweet_count       2331 non-null int64
followers_count     2331 non-null int64
friends_count       2331 non-null int64
url                 2331 non-null object
retweeted_status    2331 non-null object
dtypes: int64(4), object(3)
memory usage: 127.6+ KB
In [402]:
# inspect a sample
api_data.sample(5)
Out[402]:
tweet_id favorite_count retweet_count followers_count friends_count url retweeted_status
2177 668643542311546881 847 508 8767948 15 https://t.co/IiSdmhkC5N original
738 778624900596654080 4724 1029 8767947 15 https://t.co/0rRxyBQt32 original
1133 724405726123311104 5260 1641 8767945 15 https://t.co/nRUaKWnJfH original
1784 676864501615042560 2079 699 8767947 15 https://t.co/NJj2uP0atK original
1369 700029284593901568 2078 583 8767945 15 https://t.co/8Nuny4lLE3 original
In [403]:
# count for retweets and original tweets 
api_data.retweeted_status.value_counts()
Out[403]:
original    2168
retweet      163
Name: retweeted_status, dtype: int64

Assessment Overview¶

Twitter Archive¶

Quality¶

  • Quality Issue (1) : data contains 181 retweets
  • Get rid of rows that are retweets
  • Quality Issue (2) : data contains 78 replies
  • Get rid of rows that are replies
  • Quality Issue (3) : timestamp column is in string format
  • Change to datetime format
  • Quality Issue (4) : tweet_id column contains integers
  • Change to string format
  • Quality Issue (5) : rating_numerator column values need to be in floats
  • Change to floats programmatically
  • Quality Issue (6) : rating_denominators column needs one consistent value all across (10)
  • Use lambda function to change all denominators to 10
  • Quality Issue (7) : rating_denominator values need to be in floats
  • Change to floats programmatically
  • Quality Issue (8) : name column contains non-name values
  • Replace them with string 'None'

Tidiness¶

  • Tidiness Issue (1) : Data contains 4 columns for dog stages, need just one column and have them as values

Image Prediction¶

Quality¶

  • Quality Issue (9) : Data contains 66 duplicated jpg_urls
  • Quality Issue (10) : tweet_id column needs to be in string format

Tidiness¶

  • Tidiness Issue (2) : Needs one column each for image prediction and confidence level

Twitter API Data¶

Quality¶

  • Quality Issue (11) : Data contains retweets, get rid of them to keep only the originals

Tidiness¶

  • Tidiness Issue (3) : Dataframe needs to be joined with the other two dataframes
In [414]:
# copy all original data frames
clean_archive = archive.copy()
clean_images = images.copy()
clean_api = api_data.copy()

Data Cleaning¶

Quality Issue 1: (Twitter Archive) Remove Retweets¶

Define¶

The given Twitter Archive df contains 181 retweets. Remove rows that are retweets.

In [415]:
sum(clean_archive.retweeted_status_id.notnull())
Out[415]:
181

Code¶

In [416]:
clean_archive = clean_archive[clean_archive.retweeted_status_id.isna()]

Test¶

In [417]:
sum(clean_archive.retweeted_status_id.notnull())
Out[417]:
0
In [418]:
# we can remove retweet columns from the df 
clean_archive = clean_archive.drop(['retweeted_status_id',
                                    'retweeted_status_user_id',
                                    'retweeted_status_timestamp'], axis = 1)

Quality Issue 2: (Twitter Archive) Remove Replies¶

Define¶

The given Twitter Archive df contains 78 retweets. Remove rows that are replies.

In [419]:
sum(clean_archive.in_reply_to_status_id.notnull())
Out[419]:
78

Code¶

In [420]:
clean_archive = clean_archive[clean_archive.in_reply_to_status_id.isna()]

Test¶

In [421]:
sum(clean_archive.in_reply_to_status_id.notnull())
Out[421]:
0
In [422]:
# we can also remove replies columns from the df
clean_archive = clean_archive.drop(['in_reply_to_status_id',
                                    'in_reply_to_user_id'], axis = 1)

Quality Issue 3: (Twitter Archive) Change timestamp column type (string) to datetime¶

Define¶

The timestamp column data type is string. Change it to datetime data type.

Code¶

In [423]:
clean_archive['timestamp'] = pd.to_datetime(clean_archive.timestamp)

Test¶

In [424]:
clean_archive.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2097 entries, 0 to 2355
Data columns (total 12 columns):
tweet_id              2097 non-null int64
timestamp             2097 non-null datetime64[ns]
source                2097 non-null object
text                  2097 non-null object
expanded_urls         2094 non-null object
rating_numerator      2097 non-null int64
rating_denominator    2097 non-null int64
name                  2097 non-null object
doggo                 2097 non-null object
floofer               2097 non-null object
pupper                2097 non-null object
puppo                 2097 non-null object
dtypes: datetime64[ns](1), int64(3), object(8)
memory usage: 213.0+ KB

Quality Issue 4: (Twitter Archive) Change tweet_id data type from integer to string¶

Define¶

The tweet_id column data type is integer. Change it to string data type.

Code¶

In [425]:
clean_archive['tweet_id'] = clean_archive['tweet_id'].apply(str)

Test¶

In [426]:
clean_archive.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2097 entries, 0 to 2355
Data columns (total 12 columns):
tweet_id              2097 non-null object
timestamp             2097 non-null datetime64[ns]
source                2097 non-null object
text                  2097 non-null object
expanded_urls         2094 non-null object
rating_numerator      2097 non-null int64
rating_denominator    2097 non-null int64
name                  2097 non-null object
doggo                 2097 non-null object
floofer               2097 non-null object
pupper                2097 non-null object
puppo                 2097 non-null object
dtypes: datetime64[ns](1), int64(2), object(9)
memory usage: 213.0+ KB
In [427]:
type(clean_archive.tweet_id[0])
Out[427]:
str

Quality Issue 5: (Twitter Archive) rating_numerator column values need to be in floats¶

Define¶

Change rating_numerator column data type to float.

Code¶

In [428]:
clean_archive['rating_numerator'] = clean_archive['rating_numerator'].astype(float)

Test¶

In [429]:
clean_archive.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2097 entries, 0 to 2355
Data columns (total 12 columns):
tweet_id              2097 non-null object
timestamp             2097 non-null datetime64[ns]
source                2097 non-null object
text                  2097 non-null object
expanded_urls         2094 non-null object
rating_numerator      2097 non-null float64
rating_denominator    2097 non-null int64
name                  2097 non-null object
doggo                 2097 non-null object
floofer               2097 non-null object
pupper                2097 non-null object
puppo                 2097 non-null object
dtypes: datetime64[ns](1), float64(1), int64(1), object(9)
memory usage: 293.0+ KB

Quality Issue 6 : (Twitter Archive) rating_denominator column needs one consistent value all across¶

Define¶

As the common denominator should be 10, change all denominators that aren't equal to 10.

Code¶

In [430]:
clean_archive["rating_denominator"] = clean_archive["rating_denominator"].apply(lambda x: 10 if x != 10 else 10)

Test¶

In [431]:
clean_archive.query("rating_denominator!=10")
Out[431]:
tweet_id timestamp source text expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo

Quality Issue 7 : (Twitter Archive) rating_denominator values need to be in floats¶

Define¶

To keep consistecy, change rating_denominator to floats.

Code¶

In [432]:
clean_archive['rating_denominator'] = clean_archive['rating_denominator'].astype(float)

Test¶

In [433]:
clean_archive.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2097 entries, 0 to 2355
Data columns (total 12 columns):
tweet_id              2097 non-null object
timestamp             2097 non-null datetime64[ns]
source                2097 non-null object
text                  2097 non-null object
expanded_urls         2094 non-null object
rating_numerator      2097 non-null float64
rating_denominator    2097 non-null float64
name                  2097 non-null object
doggo                 2097 non-null object
floofer               2097 non-null object
pupper                2097 non-null object
puppo                 2097 non-null object
dtypes: datetime64[ns](1), float64(2), object(9)
memory usage: 293.0+ KB

Quality Issue 8 : (Twitter Archive) name column contains non-name values¶

Define¶

Replace all non-names in the name column with string 'None'.

Code¶

In [434]:
clean_archive.name.value_counts()
Out[434]:
None        603
a            55
Lucy         11
Charlie      11
Oliver       10
Cooper       10
Penny         9
Tucker        9
Winston       8
the           8
Lola          8
Sadie         8
Daisy         7
Toby          7
Koda          6
Bella         6
an            6
Jax           6
Bo            6
Bailey        6
Oscar         6
Stanley       6
Chester       5
Bentley       5
Milo          5
Dave          5
Rusty         5
Louis         5
Scout         5
Leo           5
           ... 
Ralpher       1
Cheesy        1
Duchess       1
Brutus        1
Beya          1
Jimbo         1
Ralphé        1
Charl         1
Leonard       1
Obie          1
Jennifur      1
Pinot         1
Chelsea       1
Arlen         1
Brooks        1
Maxwell       1
Diogi         1
Emma          1
Buckley       1
Barclay       1
Jeb           1
Storkson      1
Sunshine      1
Erik          1
Marq          1
Aqua          1
BeBe          1
Robin         1
Teddy         1
Jerome        1
Name: name, Length: 955, dtype: int64
In [435]:
# use regex where names start with lower-case letters 
mask = clean_archive.name.str.contains(pat='^[a-z]', regex = True)  
clean_archive[mask].name.value_counts().sort_index()
Out[435]:
a               55
actually         2
all              1
an               6
by               1
getting          2
his              1
incredibly       1
infuriating      1
just             3
life             1
light            1
mad              1
my               1
not              2
officially       1
old              1
one              4
quite            3
space            1
such             1
the              8
this             1
unacceptable     1
very             4
Name: name, dtype: int64
In [436]:
# replace non-names in the name column with string 'None'
clean_archive.loc[mask, 'name'] = "None"

Test¶

In [437]:
sum(clean_archive.name.str.contains(pat='^[a-z]', regex = True))
Out[437]:
0

Tidiness Issue 1 : (Twitter Archive) data contains 4 columns for dog stages¶

Define¶

Create one column for dog stages and have them (doggo,floofer,pupper,puppo) as values instead.

In [438]:
clean_archive.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2097 entries, 0 to 2355
Data columns (total 12 columns):
tweet_id              2097 non-null object
timestamp             2097 non-null datetime64[ns]
source                2097 non-null object
text                  2097 non-null object
expanded_urls         2094 non-null object
rating_numerator      2097 non-null float64
rating_denominator    2097 non-null float64
name                  2097 non-null object
doggo                 2097 non-null object
floofer               2097 non-null object
pupper                2097 non-null object
puppo                 2097 non-null object
dtypes: datetime64[ns](1), float64(2), object(9)
memory usage: 293.0+ KB
In [439]:
clean_archive.doggo.value_counts()
Out[439]:
None     2014
doggo      83
Name: doggo, dtype: int64
In [440]:
clean_archive.floofer.value_counts()
Out[440]:
None       2087
floofer      10
Name: floofer, dtype: int64
In [441]:
clean_archive.pupper.value_counts()
Out[441]:
None      1867
pupper     230
Name: pupper, dtype: int64
In [442]:
clean_archive.puppo.value_counts()
Out[442]:
None     2073
puppo      24
Name: puppo, dtype: int64

Code¶

In [443]:
# convert 'None' to empty strings similarly to what we did before 
clean_archive.doggo.replace('None', '', inplace=True)
clean_archive.floofer.replace('None', '', inplace=True)
clean_archive.pupper.replace('None', '', inplace=True)
clean_archive.puppo.replace('None', '', inplace=True)
In [444]:
# combine the 4 columns
clean_archive['stage'] = clean_archive.text.str.extract('(doggo|floofer|pupper|puppo)', expand = True)
In [445]:
# check 
clean_archive.stage.value_counts()
Out[445]:
pupper     240
doggo       80
puppo       29
floofer      4
Name: stage, dtype: int64
In [447]:
#double check
clean_archive.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2097 entries, 0 to 2355
Data columns (total 13 columns):
tweet_id              2097 non-null object
timestamp             2097 non-null datetime64[ns]
source                2097 non-null object
text                  2097 non-null object
expanded_urls         2094 non-null object
rating_numerator      2097 non-null float64
rating_denominator    2097 non-null float64
name                  2097 non-null object
doggo                 2097 non-null object
floofer               2097 non-null object
pupper                2097 non-null object
puppo                 2097 non-null object
stage                 353 non-null object
dtypes: datetime64[ns](1), float64(2), object(10)
memory usage: 309.4+ KB
In [448]:
# now drop the 4 columns as they are no longer needed 
clean_archive.drop(['doggo','floofer','pupper','puppo'], axis=1, inplace = True)
In [449]:
# change datatype from object to category
clean_archive['stage'] = clean_archive['stage'].astype('category')

Test¶

In [450]:
clean_archive.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2097 entries, 0 to 2355
Data columns (total 9 columns):
tweet_id              2097 non-null object
timestamp             2097 non-null datetime64[ns]
source                2097 non-null object
text                  2097 non-null object
expanded_urls         2094 non-null object
rating_numerator      2097 non-null float64
rating_denominator    2097 non-null float64
name                  2097 non-null object
stage                 353 non-null category
dtypes: category(1), datetime64[ns](1), float64(2), object(5)
memory usage: 229.7+ KB

Quality Issue 9 : (Image Predctions) Data contains 66 duplicated jpg_urls¶

Define¶

Dataframe contains 66 duplicated jpg_urls which need to be removed.

Code¶

In [461]:
clean_images = clean_images.drop_duplicates(subset=['jpg_url'], keep='last')

Test¶

In [462]:
sum(clean_images.jpg_url.duplicated())
Out[462]:
0

Quality Issue 10 : (Image Predictions) tweet_id column needs to be in string format¶

Define¶

The tweet_id column data type is integer. Change it to string data type to keep it consistent with other data frames.

Code¶

In [463]:
clean_images['tweet_id'] = clean_images['tweet_id'].apply(str)

Test¶

In [464]:
clean_images.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2009 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2009 non-null object
jpg_url     2009 non-null object
img_num     2009 non-null int64
p1          2009 non-null object
p1_conf     2009 non-null float64
p1_dog      2009 non-null bool
p2          2009 non-null object
p2_conf     2009 non-null float64
p2_dog      2009 non-null bool
p3          2009 non-null object
p3_conf     2009 non-null float64
p3_dog      2009 non-null bool
dtypes: bool(3), float64(3), int64(1), object(5)
memory usage: 162.8+ KB
In [465]:
type(clean_images.tweet_id[0])
Out[465]:
str

Tidiness Issue 2 : (Image Predictions) Needs one column each for image prediction and confidence level¶

Define¶

Clean up by having 1 column each for image prediction and confidence level, rather than having them in 3 different columns.

Code¶

In [466]:
clean_images
Out[466]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True
1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature_pinscher 0.074192 True Rhodesian_ridgeback 0.072010 True
2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German_shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True
3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg 1 Rhodesian_ridgeback 0.408143 True redbone 0.360687 True miniature_pinscher 0.222752 True
4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg 1 miniature_pinscher 0.560311 True Rottweiler 0.243682 True Doberman 0.154629 True
5 666050758794694657 https://pbs.twimg.com/media/CT5Jof1WUAEuVxN.jpg 1 Bernese_mountain_dog 0.651137 True English_springer 0.263788 True Greater_Swiss_Mountain_dog 0.016199 True
6 666051853826850816 https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg 1 box_turtle 0.933012 False mud_turtle 0.045885 False terrapin 0.017885 False
7 666055525042405380 https://pbs.twimg.com/media/CT5N9tpXIAAifs1.jpg 1 chow 0.692517 True Tibetan_mastiff 0.058279 True fur_coat 0.054449 False
8 666057090499244032 https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg 1 shopping_cart 0.962465 False shopping_basket 0.014594 False golden_retriever 0.007959 True
9 666058600524156928 https://pbs.twimg.com/media/CT5Qw94XAAA_2dP.jpg 1 miniature_poodle 0.201493 True komondor 0.192305 True soft-coated_wheaten_terrier 0.082086 True
10 666063827256086533 https://pbs.twimg.com/media/CT5Vg_wXIAAXfnj.jpg 1 golden_retriever 0.775930 True Tibetan_mastiff 0.093718 True Labrador_retriever 0.072427 True
11 666071193221509120 https://pbs.twimg.com/media/CT5cN_3WEAAlOoZ.jpg 1 Gordon_setter 0.503672 True Yorkshire_terrier 0.174201 True Pekinese 0.109454 True
12 666073100786774016 https://pbs.twimg.com/media/CT5d9DZXAAALcwe.jpg 1 Walker_hound 0.260857 True English_foxhound 0.175382 True Ibizan_hound 0.097471 True
13 666082916733198337 https://pbs.twimg.com/media/CT5m4VGWEAAtKc8.jpg 1 pug 0.489814 True bull_mastiff 0.404722 True French_bulldog 0.048960 True
14 666094000022159362 https://pbs.twimg.com/media/CT5w9gUW4AAsBNN.jpg 1 bloodhound 0.195217 True German_shepherd 0.078260 True malinois 0.075628 True
15 666099513787052032 https://pbs.twimg.com/media/CT51-JJUEAA6hV8.jpg 1 Lhasa 0.582330 True Shih-Tzu 0.166192 True Dandie_Dinmont 0.089688 True
16 666102155909144576 https://pbs.twimg.com/media/CT54YGiWUAEZnoK.jpg 1 English_setter 0.298617 True Newfoundland 0.149842 True borzoi 0.133649 True
17 666104133288665088 https://pbs.twimg.com/media/CT56LSZWoAAlJj2.jpg 1 hen 0.965932 False cock 0.033919 False partridge 0.000052 False
18 666268910803644416 https://pbs.twimg.com/media/CT8QCd1WEAADXws.jpg 1 desktop_computer 0.086502 False desk 0.085547 False bookcase 0.079480 False
19 666273097616637952 https://pbs.twimg.com/media/CT8T1mtUwAA3aqm.jpg 1 Italian_greyhound 0.176053 True toy_terrier 0.111884 True basenji 0.111152 True
20 666287406224695296 https://pbs.twimg.com/media/CT8g3BpUEAAuFjg.jpg 1 Maltese_dog 0.857531 True toy_poodle 0.063064 True miniature_poodle 0.025581 True
21 666293911632134144 https://pbs.twimg.com/media/CT8mx7KW4AEQu8N.jpg 1 three-toed_sloth 0.914671 False otter 0.015250 False great_grey_owl 0.013207 False
22 666337882303524864 https://pbs.twimg.com/media/CT9OwFIWEAMuRje.jpg 1 ox 0.416669 False Newfoundland 0.278407 True groenendael 0.102643 True
23 666345417576210432 https://pbs.twimg.com/media/CT9Vn7PWoAA_ZCM.jpg 1 golden_retriever 0.858744 True Chesapeake_Bay_retriever 0.054787 True Labrador_retriever 0.014241 True
24 666353288456101888 https://pbs.twimg.com/media/CT9cx0tUEAAhNN_.jpg 1 malamute 0.336874 True Siberian_husky 0.147655 True Eskimo_dog 0.093412 True
25 666362758909284353 https://pbs.twimg.com/media/CT9lXGsUcAAyUFt.jpg 1 guinea_pig 0.996496 False skunk 0.002402 False hamster 0.000461 False
26 666373753744588802 https://pbs.twimg.com/media/CT9vZEYWUAAlZ05.jpg 1 soft-coated_wheaten_terrier 0.326467 True Afghan_hound 0.259551 True briard 0.206803 True
27 666396247373291520 https://pbs.twimg.com/media/CT-D2ZHWIAA3gK1.jpg 1 Chihuahua 0.978108 True toy_terrier 0.009397 True papillon 0.004577 True
28 666407126856765440 https://pbs.twimg.com/media/CT-NvwmW4AAugGZ.jpg 1 black-and-tan_coonhound 0.529139 True bloodhound 0.244220 True flat-coated_retriever 0.173810 True
29 666411507551481857 https://pbs.twimg.com/media/CT-RugiWIAELEaq.jpg 1 coho 0.404640 False barracouta 0.271485 False gar 0.189945 False
... ... ... ... ... ... ... ... ... ... ... ... ...
2044 886258384151887873 https://pbs.twimg.com/media/DEyfTG4UMAE4aE9.jpg 1 pug 0.943575 True shower_cap 0.025286 False Siamese_cat 0.002849 False
2045 886366144734445568 https://pbs.twimg.com/media/DE0BTnQUwAApKEH.jpg 1 French_bulldog 0.999201 True Chihuahua 0.000361 True Boston_bull 0.000076 True
2046 886680336477933568 https://pbs.twimg.com/media/DE4fEDzWAAAyHMM.jpg 1 convertible 0.738995 False sports_car 0.139952 False car_wheel 0.044173 False
2047 886736880519319552 https://pbs.twimg.com/media/DE5Se8FXcAAJFx4.jpg 1 kuvasz 0.309706 True Great_Pyrenees 0.186136 True Dandie_Dinmont 0.086346 True
2048 886983233522544640 https://pbs.twimg.com/media/DE8yicJW0AAAvBJ.jpg 2 Chihuahua 0.793469 True toy_terrier 0.143528 True can_opener 0.032253 False
2049 887101392804085760 https://pbs.twimg.com/media/DE-eAq6UwAA-jaE.jpg 1 Samoyed 0.733942 True Eskimo_dog 0.035029 True Staffordshire_bullterrier 0.029705 True
2050 887343217045368832 https://pbs.twimg.com/ext_tw_video_thumb/88734... 1 Mexican_hairless 0.330741 True sea_lion 0.275645 False Weimaraner 0.134203 True
2052 887517139158093824 https://pbs.twimg.com/ext_tw_video_thumb/88751... 1 limousine 0.130432 False tow_truck 0.029175 False shopping_cart 0.026321 False
2053 887705289381826560 https://pbs.twimg.com/media/DFHDQBbXgAEqY7t.jpg 1 basset 0.821664 True redbone 0.087582 True Weimaraner 0.026236 True
2054 888078434458587136 https://pbs.twimg.com/media/DFMWn56WsAAkA7B.jpg 1 French_bulldog 0.995026 True pug 0.000932 True bull_mastiff 0.000903 True
2055 888202515573088257 https://pbs.twimg.com/media/DFDw2tyUQAAAFke.jpg 2 Pembroke 0.809197 True Rhodesian_ridgeback 0.054950 True beagle 0.038915 True
2056 888554962724278272 https://pbs.twimg.com/media/DFTH_O-UQAACu20.jpg 3 Siberian_husky 0.700377 True Eskimo_dog 0.166511 True malamute 0.111411 True
2057 888804989199671297 https://pbs.twimg.com/media/DFWra-3VYAA2piG.jpg 1 golden_retriever 0.469760 True Labrador_retriever 0.184172 True English_setter 0.073482 True
2058 888917238123831296 https://pbs.twimg.com/media/DFYRgsOUQAARGhO.jpg 1 golden_retriever 0.714719 True Tibetan_mastiff 0.120184 True Labrador_retriever 0.105506 True
2059 889278841981685760 https://pbs.twimg.com/ext_tw_video_thumb/88927... 1 whippet 0.626152 True borzoi 0.194742 True Saluki 0.027351 True
2060 889531135344209921 https://pbs.twimg.com/media/DFg_2PVW0AEHN3p.jpg 1 golden_retriever 0.953442 True Labrador_retriever 0.013834 True redbone 0.007958 True
2061 889638837579907072 https://pbs.twimg.com/media/DFihzFfXsAYGDPR.jpg 1 French_bulldog 0.991650 True boxer 0.002129 True Staffordshire_bullterrier 0.001498 True
2062 889665388333682689 https://pbs.twimg.com/media/DFi579UWsAAatzw.jpg 1 Pembroke 0.966327 True Cardigan 0.027356 True basenji 0.004633 True
2063 889880896479866881 https://pbs.twimg.com/media/DFl99B1WsAITKsg.jpg 1 French_bulldog 0.377417 True Labrador_retriever 0.151317 True muzzle 0.082981 False
2064 890006608113172480 https://pbs.twimg.com/media/DFnwSY4WAAAMliS.jpg 1 Samoyed 0.957979 True Pomeranian 0.013884 True chow 0.008167 True
2065 890240255349198849 https://pbs.twimg.com/media/DFrEyVuW0AAO3t9.jpg 1 Pembroke 0.511319 True Cardigan 0.451038 True Chihuahua 0.029248 True
2066 890609185150312448 https://pbs.twimg.com/media/DFwUU__XcAEpyXI.jpg 1 Irish_terrier 0.487574 True Irish_setter 0.193054 True Chesapeake_Bay_retriever 0.118184 True
2067 890729181411237888 https://pbs.twimg.com/media/DFyBahAVwAAhUTd.jpg 2 Pomeranian 0.566142 True Eskimo_dog 0.178406 True Pembroke 0.076507 True
2068 890971913173991426 https://pbs.twimg.com/media/DF1eOmZXUAALUcq.jpg 1 Appenzeller 0.341703 True Border_collie 0.199287 True ice_lolly 0.193548 False
2069 891087950875897856 https://pbs.twimg.com/media/DF3HwyEWsAABqE6.jpg 1 Chesapeake_Bay_retriever 0.425595 True Irish_terrier 0.116317 True Indian_elephant 0.076902 False
2070 891327558926688256 https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg 2 basset 0.555712 True English_springer 0.225770 True German_short-haired_pointer 0.175219 True
2071 891689557279858688 https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg 1 paper_towel 0.170278 False Labrador_retriever 0.168086 True spatula 0.040836 False
2072 891815181378084864 https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg 1 Chihuahua 0.716012 True malamute 0.078253 True kelpie 0.031379 True
2073 892177421306343426 https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg 1 Chihuahua 0.323581 True Pekinese 0.090647 True papillon 0.068957 True
2074 892420643555336193 https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg 1 orange 0.097049 False bagel 0.085851 False banana 0.076110 False

2009 rows × 12 columns

In [467]:
clean_images.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2009 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2009 non-null object
jpg_url     2009 non-null object
img_num     2009 non-null int64
p1          2009 non-null object
p1_conf     2009 non-null float64
p1_dog      2009 non-null bool
p2          2009 non-null object
p2_conf     2009 non-null float64
p2_dog      2009 non-null bool
p3          2009 non-null object
p3_conf     2009 non-null float64
p3_dog      2009 non-null bool
dtypes: bool(3), float64(3), int64(1), object(5)
memory usage: 242.8+ KB
In [468]:
clean_images.p1.value_counts()
Out[468]:
golden_retriever             139
Labrador_retriever            95
Pembroke                      88
Chihuahua                     79
pug                           55
chow                          41
Samoyed                       40
Pomeranian                    38
toy_poodle                    38
malamute                      29
cocker_spaniel                27
French_bulldog                26
Chesapeake_Bay_retriever      23
miniature_pinscher            22
seat_belt                     21
Siberian_husky                20
German_shepherd               20
web_site                      19
Staffordshire_bullterrier     19
Shetland_sheepdog             18
Maltese_dog                   18
teddy                         18
Eskimo_dog                    18
Rottweiler                    17
beagle                        17
Shih-Tzu                      17
Cardigan                      17
kuvasz                        16
Lakeland_terrier              16
Italian_greyhound             16
                            ... 
long-horned_beetle             1
bighorn                        1
limousine                      1
park_bench                     1
sunglasses                     1
robin                          1
lynx                           1
soccer_ball                    1
water_bottle                   1
grey_fox                       1
carousel                       1
flamingo                       1
swab                           1
cup                            1
cheeseburger                   1
snowmobile                     1
alp                            1
remote_control                 1
lacewing                       1
hummingbird                    1
bee_eater                      1
loupe                          1
coral_reef                     1
tricycle                       1
barbell                        1
timber_wolf                    1
sundial                        1
handkerchief                   1
hotdog                         1
Madagascar_cat                 1
Name: p1, Length: 378, dtype: int64
In [469]:
clean_images.p2.value_counts()
Out[469]:
Labrador_retriever                96
golden_retriever                  83
Cardigan                          72
Chihuahua                         43
Chesapeake_Bay_retriever          40
French_bulldog                    40
Pomeranian                        39
toy_poodle                        36
Siberian_husky                    33
cocker_spaniel                    32
miniature_poodle                  32
beagle                            28
Eskimo_dog                        27
collie                            26
kuvasz                            25
Pembroke                          25
Italian_greyhound                 22
Pekinese                          21
miniature_pinscher                20
American_Staffordshire_terrier    20
chow                              20
malinois                          20
Boston_bull                       19
toy_terrier                       19
Samoyed                           19
Norwegian_elkhound                19
Staffordshire_bullterrier         18
pug                               17
Shih-Tzu                          16
Irish_terrier                     16
                                  ..
shower_curtain                     1
dock                               1
desk                               1
cradle                             1
hamper                             1
snowmobile                         1
house_finch                        1
accordion                          1
birdhouse                          1
chimpanzee                         1
snail                              1
lighter                            1
bib                                1
dumbbell                           1
wallaby                            1
can_opener                         1
porcupine                          1
triceratops                        1
promontory                         1
crate                              1
spotted_salamander                 1
tree_frog                          1
breakwater                         1
common_newt                        1
iPod                               1
tarantula                          1
grey_whale                         1
mosquito_net                       1
menu                               1
pelican                            1
Name: p2, Length: 405, dtype: int64
In [470]:
clean_images.p3.value_counts()
Out[470]:
Labrador_retriever                76
Chihuahua                         57
golden_retriever                  46
Eskimo_dog                        36
kelpie                            33
chow                              31
kuvasz                            31
Staffordshire_bullterrier         30
beagle                            30
Pekinese                          29
toy_poodle                        29
cocker_spaniel                    29
Pomeranian                        27
Pembroke                          26
Great_Pyrenees                    26
malamute                          25
French_bulldog                    25
Chesapeake_Bay_retriever          24
Cardigan                          23
American_Staffordshire_terrier    23
pug                               21
bull_mastiff                      20
basenji                           20
Siberian_husky                    19
toy_terrier                       18
Boston_bull                       17
Shetland_sheepdog                 17
boxer                             16
Lakeland_terrier                  16
dingo                             15
                                  ..
mushroom                           1
swimming_trunks                    1
pop_bottle                         1
meerkat                            1
shovel                             1
golfcart                           1
barbell                            1
chimpanzee                         1
snail                              1
notebook                           1
ox                                 1
bib                                1
vacuum                             1
Indian_elephant                    1
lion                               1
acorn                              1
chain                              1
wallet                             1
brown_bear                         1
crayfish                           1
drumstick                          1
buckeye                            1
traffic_light                      1
desktop_computer                   1
greenhouse                         1
bow_tie                            1
wok                                1
plunger                            1
common_newt                        1
red_wolf                           1
Name: p3, Length: 408, dtype: int64
In [471]:
clean_images.query("p1_conf < p2_conf or p2_conf < p3_conf or p1_conf < p3_conf")
Out[471]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
  • Notice how we didn't retrieve anything. This means that the highest confidence is stored in p1 (neural network made with the highest probability), so check by writing a function where the dogtype and confidence value first aligns and sort it in decreasing order ( p1 -> p2 -> p3 )
In [472]:
# get dog breed and its respective confidence 

breed = []
confidence = []


def image(clean_images):
    if clean_images['p1_dog'] == True:
        breed.append(clean_images['p1'])
        confidence.append(clean_images['p1_conf'])
    elif clean_images['p2_dog'] == True:
        breed.append(clean_images['p2'])
        confidence.append(clean_images['p2_conf'])
    elif clean_images['p3_dog'] == True:
        breed.append(clean_images['p3'])
        confidence.append(clean_images['p3_conf'])
    else:  # if they don't match, write "None" b/c tweets beyond August 1st, 2017 don't have image predictions
        breed.append('None')
        confidence.append('None')

clean_images.apply(image, axis=1)

#create new columns
clean_images['breed'] = breed
clean_images['confidence'] = confidence
In [473]:
clean_images.breed.value_counts()
Out[473]:
None                              318
golden_retriever                  158
Labrador_retriever                108
Pembroke                           95
Chihuahua                          91
pug                                63
toy_poodle                         51
chow                               48
Samoyed                            42
Pomeranian                         42
malamute                           33
Chesapeake_Bay_retriever           31
French_bulldog                     31
cocker_spaniel                     30
miniature_pinscher                 25
Eskimo_dog                         22
German_shepherd                    21
Staffordshire_bullterrier          21
Cardigan                           21
Shih-Tzu                           20
Siberian_husky                     20
beagle                             20
Shetland_sheepdog                  19
Maltese_dog                        19
Rottweiler                         19
kuvasz                             19
Lakeland_terrier                   18
basset                             17
Italian_greyhound                  17
American_Staffordshire_terrier     16
                                 ... 
Welsh_springer_spaniel              4
Scottish_deerhound                  4
Tibetan_terrier                     4
keeshond                            4
Weimaraner                          4
Gordon_setter                       4
Greater_Swiss_Mountain_dog          3
toy_terrier                         3
Leonberg                            3
briard                              3
Irish_water_spaniel                 3
curly-coated_retriever              3
komondor                            3
cairn                               3
Brabancon_griffon                   3
Afghan_hound                        3
Appenzeller                         2
Australian_terrier                  2
Sussex_spaniel                      2
groenendael                         2
wire-haired_fox_terrier             2
black-and-tan_coonhound             2
clumber                             1
Japanese_spaniel                    1
Irish_wolfhound                     1
Scotch_terrier                      1
silky_terrier                       1
standard_schnauzer                  1
Bouvier_des_Flandres                1
EntleBucher                         1
Name: breed, Length: 114, dtype: int64
In [474]:
# Remove rows where breed is "None" (doens't align to our Project Motivation)
clean_images = clean_images[clean_images['breed'] != 'None']
In [475]:
# Now, drop these columns as they're no longer needed 
clean_images = clean_images.drop(['img_num','p1','p1_conf','p1_dog','p2','p2_conf','p2_dog','p3','p3_conf',
                                  'p3_dog'], 1)

Test¶

In [476]:
clean_images
Out[476]:
tweet_id jpg_url breed confidence
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg Welsh_springer_spaniel 0.465074
1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg redbone 0.506826
2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg German_shepherd 0.596461
3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg Rhodesian_ridgeback 0.408143
4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg miniature_pinscher 0.560311
5 666050758794694657 https://pbs.twimg.com/media/CT5Jof1WUAEuVxN.jpg Bernese_mountain_dog 0.651137
7 666055525042405380 https://pbs.twimg.com/media/CT5N9tpXIAAifs1.jpg chow 0.692517
8 666057090499244032 https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg golden_retriever 0.00795896
9 666058600524156928 https://pbs.twimg.com/media/CT5Qw94XAAA_2dP.jpg miniature_poodle 0.201493
10 666063827256086533 https://pbs.twimg.com/media/CT5Vg_wXIAAXfnj.jpg golden_retriever 0.77593
11 666071193221509120 https://pbs.twimg.com/media/CT5cN_3WEAAlOoZ.jpg Gordon_setter 0.503672
12 666073100786774016 https://pbs.twimg.com/media/CT5d9DZXAAALcwe.jpg Walker_hound 0.260857
13 666082916733198337 https://pbs.twimg.com/media/CT5m4VGWEAAtKc8.jpg pug 0.489814
14 666094000022159362 https://pbs.twimg.com/media/CT5w9gUW4AAsBNN.jpg bloodhound 0.195217
15 666099513787052032 https://pbs.twimg.com/media/CT51-JJUEAA6hV8.jpg Lhasa 0.58233
16 666102155909144576 https://pbs.twimg.com/media/CT54YGiWUAEZnoK.jpg English_setter 0.298617
19 666273097616637952 https://pbs.twimg.com/media/CT8T1mtUwAA3aqm.jpg Italian_greyhound 0.176053
20 666287406224695296 https://pbs.twimg.com/media/CT8g3BpUEAAuFjg.jpg Maltese_dog 0.857531
22 666337882303524864 https://pbs.twimg.com/media/CT9OwFIWEAMuRje.jpg Newfoundland 0.278407
23 666345417576210432 https://pbs.twimg.com/media/CT9Vn7PWoAA_ZCM.jpg golden_retriever 0.858744
24 666353288456101888 https://pbs.twimg.com/media/CT9cx0tUEAAhNN_.jpg malamute 0.336874
26 666373753744588802 https://pbs.twimg.com/media/CT9vZEYWUAAlZ05.jpg soft-coated_wheaten_terrier 0.326467
27 666396247373291520 https://pbs.twimg.com/media/CT-D2ZHWIAA3gK1.jpg Chihuahua 0.978108
28 666407126856765440 https://pbs.twimg.com/media/CT-NvwmW4AAugGZ.jpg black-and-tan_coonhound 0.529139
30 666418789513326592 https://pbs.twimg.com/media/CT-YWb7U8AA7QnN.jpg toy_terrier 0.14968
31 666421158376562688 https://pbs.twimg.com/media/CT-aggCXAAIMfT3.jpg Blenheim_spaniel 0.906777
32 666428276349472768 https://pbs.twimg.com/media/CT-g-0DUwAEQdSn.jpg Pembroke 0.371361
33 666430724426358785 https://pbs.twimg.com/media/CT-jNYqW4AAPi2M.jpg Irish_terrier 0.104109
34 666435652385423360 https://pbs.twimg.com/media/CT-nsTQWEAEkyDn.jpg Chesapeake_Bay_retriever 0.18413
35 666437273139982337 https://pbs.twimg.com/media/CT-pKmRWIAAxUWj.jpg Chihuahua 0.671853
... ... ... ... ...
2041 885311592912609280 https://pbs.twimg.com/media/C4bTH6nWMAAX_bJ.jpg Labrador_retriever 0.908703
2042 885528943205470208 https://pbs.twimg.com/media/DEoH3yvXgAAzQtS.jpg pug 0.369275
2043 885984800019947520 https://pbs.twimg.com/media/DEumeWWV0AA-Z61.jpg Blenheim_spaniel 0.972494
2044 886258384151887873 https://pbs.twimg.com/media/DEyfTG4UMAE4aE9.jpg pug 0.943575
2045 886366144734445568 https://pbs.twimg.com/media/DE0BTnQUwAApKEH.jpg French_bulldog 0.999201
2047 886736880519319552 https://pbs.twimg.com/media/DE5Se8FXcAAJFx4.jpg kuvasz 0.309706
2048 886983233522544640 https://pbs.twimg.com/media/DE8yicJW0AAAvBJ.jpg Chihuahua 0.793469
2049 887101392804085760 https://pbs.twimg.com/media/DE-eAq6UwAA-jaE.jpg Samoyed 0.733942
2050 887343217045368832 https://pbs.twimg.com/ext_tw_video_thumb/88734... Mexican_hairless 0.330741
2053 887705289381826560 https://pbs.twimg.com/media/DFHDQBbXgAEqY7t.jpg basset 0.821664
2054 888078434458587136 https://pbs.twimg.com/media/DFMWn56WsAAkA7B.jpg French_bulldog 0.995026
2055 888202515573088257 https://pbs.twimg.com/media/DFDw2tyUQAAAFke.jpg Pembroke 0.809197
2056 888554962724278272 https://pbs.twimg.com/media/DFTH_O-UQAACu20.jpg Siberian_husky 0.700377
2057 888804989199671297 https://pbs.twimg.com/media/DFWra-3VYAA2piG.jpg golden_retriever 0.46976
2058 888917238123831296 https://pbs.twimg.com/media/DFYRgsOUQAARGhO.jpg golden_retriever 0.714719
2059 889278841981685760 https://pbs.twimg.com/ext_tw_video_thumb/88927... whippet 0.626152
2060 889531135344209921 https://pbs.twimg.com/media/DFg_2PVW0AEHN3p.jpg golden_retriever 0.953442
2061 889638837579907072 https://pbs.twimg.com/media/DFihzFfXsAYGDPR.jpg French_bulldog 0.99165
2062 889665388333682689 https://pbs.twimg.com/media/DFi579UWsAAatzw.jpg Pembroke 0.966327
2063 889880896479866881 https://pbs.twimg.com/media/DFl99B1WsAITKsg.jpg French_bulldog 0.377417
2064 890006608113172480 https://pbs.twimg.com/media/DFnwSY4WAAAMliS.jpg Samoyed 0.957979
2065 890240255349198849 https://pbs.twimg.com/media/DFrEyVuW0AAO3t9.jpg Pembroke 0.511319
2066 890609185150312448 https://pbs.twimg.com/media/DFwUU__XcAEpyXI.jpg Irish_terrier 0.487574
2067 890729181411237888 https://pbs.twimg.com/media/DFyBahAVwAAhUTd.jpg Pomeranian 0.566142
2068 890971913173991426 https://pbs.twimg.com/media/DF1eOmZXUAALUcq.jpg Appenzeller 0.341703
2069 891087950875897856 https://pbs.twimg.com/media/DF3HwyEWsAABqE6.jpg Chesapeake_Bay_retriever 0.425595
2070 891327558926688256 https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg basset 0.555712
2071 891689557279858688 https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg Labrador_retriever 0.168086
2072 891815181378084864 https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg Chihuahua 0.716012
2073 892177421306343426 https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg Chihuahua 0.323581

1691 rows × 4 columns

Quality Issue 11 : (Twitter API Data) Dataframe contains retweets¶

Define¶

Remove retweets to keep only the originals and to keep it consistent with the other dataframes.

In [477]:
clean_api.retweeted_status.value_counts()
Out[477]:
original    2168
retweet      163
Name: retweeted_status, dtype: int64
In [478]:
clean_api = clean_api.query("retweeted_status=='original'")

Test¶

In [479]:
clean_api.retweeted_status.value_counts()
Out[479]:
original    2168
Name: retweeted_status, dtype: int64

Tidiness Issue 3 : (Twitter API Data) Dataframe needs to be merged with the other two dataframes¶

Define¶

Twitter API dataframe should be merged with the other data frames because they share same entities. Merge the dataframes into one finalized dataframed called twitter_archive_master.

Code¶

In [480]:
#merge the first two dataframes we explored onto tweet_id column 
archive_and_images = pd.merge(clean_archive, 
                      clean_images, 
                      how = 'left', on = ['tweet_id'])
In [482]:
archive_and_images = archive_and_images[archive_and_images['jpg_url'].notnull()]
In [483]:
archive_and_images.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1606 entries, 1 to 2096
Data columns (total 12 columns):
tweet_id              1606 non-null object
timestamp             1606 non-null datetime64[ns]
source                1606 non-null object
text                  1606 non-null object
expanded_urls         1606 non-null object
rating_numerator      1606 non-null float64
rating_denominator    1606 non-null float64
name                  1606 non-null object
stage                 262 non-null category
jpg_url               1606 non-null object
breed                 1606 non-null object
confidence            1606 non-null object
dtypes: category(1), datetime64[ns](1), float64(2), object(8)
memory usage: 152.3+ KB
In [484]:
twitter_archive_master = pd.merge(archive_and_images, 
                      clean_api, 
                      how = 'left', on = ['tweet_id'])
In [485]:
twitter_archive_master.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1606 entries, 0 to 1605
Data columns (total 18 columns):
tweet_id              1606 non-null object
timestamp             1606 non-null datetime64[ns]
source                1606 non-null object
text                  1606 non-null object
expanded_urls         1606 non-null object
rating_numerator      1606 non-null float64
rating_denominator    1606 non-null float64
name                  1606 non-null object
stage                 262 non-null category
jpg_url               1606 non-null object
breed                 1606 non-null object
confidence            1606 non-null object
favorite_count        1600 non-null float64
retweet_count         1600 non-null float64
followers_count       1600 non-null float64
friends_count         1600 non-null float64
url                   1600 non-null object
retweeted_status      1600 non-null object
dtypes: category(1), datetime64[ns](1), float64(6), object(10)
memory usage: 227.6+ KB

Storing¶

As the three dataframes are now merged into one, we want to set it as a file perhaps for later use.

In [486]:
# store the frame as a file
twitter_archive_master.to_csv('twitter_archive_master.csv', 
                 index=False, encoding = 'utf-8')

Analyzing and Visualizing Data¶

Insight 1 : Which breeds have the most ratings on We Rate Dogs Twitter account ?¶

In order to see which breeds have the most ratings, I will make a new dataframe that contains the breeds with the highest number of tweets assigned.

In [496]:
# select breeds with 50 or more tweets 
most_tweeted = twitter_archive_master.groupby('breed').filter(lambda x: len(x) >= 50)

# bar chart
most_tweeted['breed'].value_counts().plot(kind = 'bar')
plt.title("We Rate Dogs's Most Rated Breeds")
plt.xlabel('Breed')
plt.ylabel('Tweets Count')

fig = plt.gcf() 
fig.set_size_inches(15,7)
/opt/conda/lib/python3.6/site-packages/matplotlib/font_manager.py:1316: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

Looking at their tweets count, the most rated breeds on "We Love Dogs" Twitter account are Golden Retriver, Labrador Retriever, Pembroke, Chihuahua, and Pug.

In [497]:
# save the figure as file 
fig.savefig('most_rated_breeds.png',bbox_inches='tight');
/opt/conda/lib/python3.6/site-packages/matplotlib/font_manager.py:1316: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

Insight 2 : Which dog stage is the most represented?¶

In order to show which dog stage is the most represented, I will display the frequencies of the stages via pie chart.

In [498]:
twitter_archive_master.stage.value_counts()
Out[498]:
pupper     176
doggo       57
puppo       26
floofer      3
Name: stage, dtype: int64
In [499]:
# create pie chart with explode property
# https://medium.com/@kvnamipara/a-better-visualisation-of-pie-charts-by-matplotlib-935b7667d77f

labels = ['pupper', 'doggo', 'puppo', 'floofer']
sizes = [176, 57, 26, 3]

# colors
colors = ['magenta','lightgreen','yellow','orange']
plt.rcParams['text.color'] = 'black'

# explsion
explode = (0.09,0.09,0.09,0.09)
 
plt.pie(sizes, colors = colors, labels=labels, autopct='%1.0f%%', startangle=90, pctdistance=0.85, 
         explode = explode)

# draw circle
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

# plot 
ax1.axis('equal')  # to make sure pie is drawn as a circle
plt.tight_layout()
plt.show()
/opt/conda/lib/python3.6/site-packages/matplotlib/font_manager.py:1316: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

Looking at the numbers of each breed represented via tweets, we see that "pupper" is the most represented with its frequency being 67%. In contrast, "floofer" frequency is 1% which is signficantly low compared to the others. This shows that there may be data extraction issues.

Insight 3 : Which breeds are most favorited by users?¶

In order to see which breeds are most favorited by users, I will group breeds with their respective favorite counts.

In [509]:
user_fav_breed = twitter_archive_master.query('breed ! = "none"').groupby(['breed'])['favorite_count'].sum().sort_values(ascending = False)
In [504]:
# select top 5 only 
top5 = user_fav_breed.head(5)

# bar chart 
top5.plot(kind = 'bar')
plt.title('Most Favorited Breeds by Users')
plt.xlabel('Breed')
plt.ylabel('Favourite Count')


fig = plt.gcf() 
fig.set_size_inches(15,7)
/opt/conda/lib/python3.6/site-packages/matplotlib/font_manager.py:1316: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))
In [505]:
# save figure as file
fig.savefig('most_fav_breeds.png',bbox_inches='tight');
/opt/conda/lib/python3.6/site-packages/matplotlib/font_manager.py:1316: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

As exepcted, the breed with the highest favorite counts is Golden Retriever then followed by Labrador Retriever, Pembroke, Chihuahua. I expected to see Pug at the end but French Bulldog took the 5th most favorited breed.

                         End of Wrangle Act. Please see Wrangle Report.