{ "cells": [ { "cell_type": "markdown", "id": "47ccbb07-d903-46ed-b092-421c966161b7", "metadata": {}, "source": [ "
\n", "
\n", "\n", "
\n", "
\n", "

May 2022

\n", "

ML Depresion

\n", "

Lucía Prieto Santamaría

\n", "
\n", "
\n", "
 
" ] }, { "cell_type": "markdown", "id": "004ca5ae-35ea-48dc-915b-9d09e5577150", "metadata": {}, "source": [ "\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "f490dfd0-a3b3-4932-9181-61baebb8cc72", "metadata": {}, "source": [ "# Generating different datasets to later apply ML techniques" ] }, { "cell_type": "markdown", "id": "ec41eaf7-4fc9-4dba-841b-d131bbf48fcd", "metadata": {}, "source": [ "In this notebook, we generate derivated datasets from the one studied in [Detecting Signs of Depression in Tweets in Spanish: Behavioral and Linguistic Analysis](https://www.jmir.org/2019/6/e14199/)" ] }, { "cell_type": "markdown", "id": "9cdb75be-1e15-470d-96d2-d15c3b688deb", "metadata": {}, "source": [ "\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "34c4f9e6-851d-4123-89ec-014d0edc93d6", "metadata": {}, "source": [ "\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "bb131503-0e3a-4a5d-b17a-741b97626f5c", "metadata": {}, "source": [ "\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "58966c3f-2eb5-4b3b-a9d2-1f44ed2032f0", "metadata": {}, "source": [ "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "4808e671-7dac-4ea2-baa5-014dbbadbca3", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "import matplotlib.cm as cm\n", "\n", "\n", "from sklearn.feature_selection import VarianceThreshold\n", "\n", "from sklearn.preprocessing import MinMaxScaler\n", "\n", "from sklearn.decomposition import PCA" ] }, { "cell_type": "markdown", "id": "6b1bf4b5-5d80-4296-a2e4-b4e5b5d9fefe", "metadata": {}, "source": [ "## 1) **Without** RTs" ] }, { "cell_type": "code", "execution_count": 2, "id": "648069a5-4dd5-4c6f-813f-419f022a17cf", "metadata": {}, "outputs": [], "source": [ "df_no_RTs = pd.read_csv('FEATURESeets_v1_withoutRetweets.tsv', sep=\"\\t\")" ] }, { "cell_type": "code", "execution_count": 3, "id": "6c4dc5fa-48c2-4b52-8407-dbbf5c2a8bf7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
USER_IDNUM_TWEETSAVG_NUM_CHARS_PER_TWEET__ALL_CHARSAVG_NUM_CHARS_PER_TWEET__NO_MENTION_HASH_LINKAVG_NUM_WORDS_PER_TWEET__NO_MENTION_HASH_LINKAVG_NUM_EMOJIS_PER_TWEETAVG_NUM_LINKS_PER_TWEETAVG_NUM_MENTIONS_PER_TWEETAVG_NUM_HASHTAGS_PER_TWEETNUM_TWEET_BY_DAY_HOUR__0...NUM_TWEETS_WITH_SENTI_LEX_POLARITY_POSPERC_TWEETS_WITH_SENTI_LEX_POLARITY_POSNUM_TWEETS_WITH_SENTI_LEX_POLARITY_NEGPERC_TWEETS_WITH_SENTI_LEX_POLARITY_NEGNUM_TWEETS_WITH_SENTI_LEX_POLARITYPERC_TWEETS_WITH_SENTI_LEX_POLARITYNUM_TWEETS_WITH_MIXED_SENTI_LEX_POLARITYPERC_TWEETS_WITH_MIXED_SENTI_LEX_POLARITYNUM_TWEETS_WITH_SINGLE_SENTI_LEX_POLARITYPERC_TWEETS_WITH_SINGLE_SENTI_LEX_POLARITY
count5.400000e+02540.000000540.000000540.000000540.000000540.000000540.000000540.000000540.000000540.000000...540.000000540.000000540.000000540.000000540.000000540.000000540.000000540.000000540.000000540.000000
mean1.396419e+171580.61481581.41995264.19577413.3533370.2519260.3484420.6571660.65716687.600000...309.8777780.201189258.3574070.163477517.3962960.33144050.8388890.033226466.5574070.298214
std3.124629e+17881.14262124.03709715.3774043.0077700.4917180.3226660.6222310.62223177.147614...229.8742490.076920224.4811540.077013350.6681330.09219941.6926420.019304320.4779110.081710
min7.938852e+0658.00000032.59719623.7640944.8013140.0000000.0000000.0000000.0000000.000000...0.0000000.0000007.0000000.00249129.0000000.0386430.0000000.00000029.0000000.036758
25%2.781368e+08799.75000061.14240353.13801411.2916920.0195420.1124330.1661040.16610427.000000...151.7500000.158554111.7500000.121077259.0000000.27422321.0000000.020114233.0000000.249625
50%1.157004e+091543.50000079.42483462.91077413.1025170.1003390.2331330.4585800.45858070.000000...268.0000000.188979215.0000000.152742467.0000000.32159639.5000000.030881418.0000000.291519
75%3.387760e+092319.500000100.43699374.21656315.3485410.3211330.4769450.9987470.998747131.250000...422.0000000.230022349.5000000.191920706.5000000.38016572.0000000.042982631.5000000.337914
max9.913374e+173241.000000134.773872113.79814225.2964537.6183061.9022093.2177573.217757500.000000...2960.0000000.9215443224.0000001.0000003224.0000001.000000314.0000000.1784973224.0000001.000000
\n", "

8 rows × 135 columns

\n", "
" ], "text/plain": [ " USER_ID NUM_TWEETS AVG_NUM_CHARS_PER_TWEET__ALL_CHARS \\\n", "count 5.400000e+02 540.000000 540.000000 \n", "mean 1.396419e+17 1580.614815 81.419952 \n", "std 3.124629e+17 881.142621 24.037097 \n", "min 7.938852e+06 58.000000 32.597196 \n", "25% 2.781368e+08 799.750000 61.142403 \n", "50% 1.157004e+09 1543.500000 79.424834 \n", "75% 3.387760e+09 2319.500000 100.436993 \n", "max 9.913374e+17 3241.000000 134.773872 \n", "\n", " AVG_NUM_CHARS_PER_TWEET__NO_MENTION_HASH_LINK \\\n", "count 540.000000 \n", "mean 64.195774 \n", "std 15.377404 \n", "min 23.764094 \n", "25% 53.138014 \n", "50% 62.910774 \n", "75% 74.216563 \n", "max 113.798142 \n", "\n", " AVG_NUM_WORDS_PER_TWEET__NO_MENTION_HASH_LINK \\\n", "count 540.000000 \n", "mean 13.353337 \n", "std 3.007770 \n", "min 4.801314 \n", "25% 11.291692 \n", "50% 13.102517 \n", "75% 15.348541 \n", "max 25.296453 \n", "\n", " AVG_NUM_EMOJIS_PER_TWEET AVG_NUM_LINKS_PER_TWEET \\\n", "count 540.000000 540.000000 \n", "mean 0.251926 0.348442 \n", "std 0.491718 0.322666 \n", "min 0.000000 0.000000 \n", "25% 0.019542 0.112433 \n", "50% 0.100339 0.233133 \n", "75% 0.321133 0.476945 \n", "max 7.618306 1.902209 \n", "\n", " AVG_NUM_MENTIONS_PER_TWEET AVG_NUM_HASHTAGS_PER_TWEET \\\n", "count 540.000000 540.000000 \n", "mean 0.657166 0.657166 \n", "std 0.622231 0.622231 \n", "min 0.000000 0.000000 \n", "25% 0.166104 0.166104 \n", "50% 0.458580 0.458580 \n", "75% 0.998747 0.998747 \n", "max 3.217757 3.217757 \n", "\n", " NUM_TWEET_BY_DAY_HOUR__0 ... NUM_TWEETS_WITH_SENTI_LEX_POLARITY_POS \\\n", "count 540.000000 ... 540.000000 \n", "mean 87.600000 ... 309.877778 \n", "std 77.147614 ... 229.874249 \n", "min 0.000000 ... 0.000000 \n", "25% 27.000000 ... 151.750000 \n", "50% 70.000000 ... 268.000000 \n", "75% 131.250000 ... 422.000000 \n", "max 500.000000 ... 2960.000000 \n", "\n", " PERC_TWEETS_WITH_SENTI_LEX_POLARITY_POS \\\n", "count 540.000000 \n", "mean 0.201189 \n", "std 0.076920 \n", "min 0.000000 \n", "25% 0.158554 \n", "50% 0.188979 \n", "75% 0.230022 \n", "max 0.921544 \n", "\n", " NUM_TWEETS_WITH_SENTI_LEX_POLARITY_NEG \\\n", "count 540.000000 \n", "mean 258.357407 \n", "std 224.481154 \n", "min 7.000000 \n", "25% 111.750000 \n", "50% 215.000000 \n", "75% 349.500000 \n", "max 3224.000000 \n", "\n", " PERC_TWEETS_WITH_SENTI_LEX_POLARITY_NEG \\\n", "count 540.000000 \n", "mean 0.163477 \n", "std 0.077013 \n", "min 0.002491 \n", "25% 0.121077 \n", "50% 0.152742 \n", "75% 0.191920 \n", "max 1.000000 \n", "\n", " NUM_TWEETS_WITH_SENTI_LEX_POLARITY \\\n", "count 540.000000 \n", "mean 517.396296 \n", "std 350.668133 \n", "min 29.000000 \n", "25% 259.000000 \n", "50% 467.000000 \n", "75% 706.500000 \n", "max 3224.000000 \n", "\n", " PERC_TWEETS_WITH_SENTI_LEX_POLARITY \\\n", "count 540.000000 \n", "mean 0.331440 \n", "std 0.092199 \n", "min 0.038643 \n", "25% 0.274223 \n", "50% 0.321596 \n", "75% 0.380165 \n", "max 1.000000 \n", "\n", " NUM_TWEETS_WITH_MIXED_SENTI_LEX_POLARITY \\\n", "count 540.000000 \n", "mean 50.838889 \n", "std 41.692642 \n", "min 0.000000 \n", "25% 21.000000 \n", "50% 39.500000 \n", "75% 72.000000 \n", "max 314.000000 \n", "\n", " PERC_TWEETS_WITH_MIXED_SENTI_LEX_POLARITY \\\n", "count 540.000000 \n", "mean 0.033226 \n", "std 0.019304 \n", "min 0.000000 \n", "25% 0.020114 \n", "50% 0.030881 \n", "75% 0.042982 \n", "max 0.178497 \n", "\n", " NUM_TWEETS_WITH_SINGLE_SENTI_LEX_POLARITY \\\n", "count 540.000000 \n", "mean 466.557407 \n", "std 320.477911 \n", "min 29.000000 \n", "25% 233.000000 \n", "50% 418.000000 \n", "75% 631.500000 \n", "max 3224.000000 \n", "\n", " PERC_TWEETS_WITH_SINGLE_SENTI_LEX_POLARITY \n", "count 540.000000 \n", "mean 0.298214 \n", "std 0.081710 \n", "min 0.036758 \n", "25% 0.249625 \n", "50% 0.291519 \n", "75% 0.337914 \n", "max 1.000000 \n", "\n", "[8 rows x 135 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_no_RTs.describe()" ] }, { "cell_type": "code", "execution_count": 4, "id": "bfbdd2f5-fc5c-4327-bd3d-da2c29f53e0b", "metadata": {}, "outputs": [], "source": [ "df_no_RTs.set_index('USER_ID', inplace=True)" ] }, { "cell_type": "code", "execution_count": 5, "id": "51acfd68-596e-4dda-93d6-01819eb4e92c", "metadata": {}, "outputs": [], "source": [ "df_no_RTs = df_no_RTs.loc[:,df_no_RTs.apply(pd.Series.nunique) != 1]" ] }, { "cell_type": "code", "execution_count": 6, "id": "8d85b265-d618-4658-900f-01799de01070", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GROUPNUM_TWEETSAVG_NUM_CHARS_PER_TWEET__ALL_CHARSAVG_NUM_CHARS_PER_TWEET__NO_MENTION_HASH_LINKAVG_NUM_WORDS_PER_TWEET__NO_MENTION_HASH_LINKAVG_NUM_EMOJIS_PER_TWEETAVG_NUM_LINKS_PER_TWEETAVG_NUM_MENTIONS_PER_TWEETAVG_NUM_HASHTAGS_PER_TWEETNUM_TWEET_BY_DAY_HOUR__0...NUM_TWEETS_WITH_SENTI_LEX_POLARITY_POSPERC_TWEETS_WITH_SENTI_LEX_POLARITY_POSNUM_TWEETS_WITH_SENTI_LEX_POLARITY_NEGPERC_TWEETS_WITH_SENTI_LEX_POLARITY_NEGNUM_TWEETS_WITH_SENTI_LEX_POLARITYPERC_TWEETS_WITH_SENTI_LEX_POLARITYNUM_TWEETS_WITH_MIXED_SENTI_LEX_POLARITYPERC_TWEETS_WITH_MIXED_SENTI_LEX_POLARITYNUM_TWEETS_WITH_SINGLE_SENTI_LEX_POLARITYPERC_TWEETS_WITH_SINGLE_SENTI_LEX_POLARITY
USER_ID
3635069722DEPRESSIVE145765.64241657.48249812.0274540.1310910.3493480.0940290.09402974...2250.1544273370.2312974970.341112650.0446124320.296500
4475963063DEPRESSIVE228952.01747552.71079111.2590650.0104850.0144170.0122320.012232226...4520.1974665000.2184368630.377021890.0388827740.338139
1571103494DEPRESSIVE82567.03272764.42303012.9503030.0060610.0787880.1163640.11636411...1950.2363642010.2436363510.425455450.0545453060.370909
4900730194DEPRESSIVE54470.34558864.15625013.7500000.0992650.2830880.0422790.04227915...1000.1838241390.2555152080.382353310.0569851770.325368
727283884786876416DEPRESSIVE221872.94679972.02795315.2276830.1289450.0590620.0392250.039225252...5110.2303885990.2700639860.4445451240.0559068620.388638
..................................................................
3034331163CONTROL129060.24883745.58062010.2519381.2488370.2806200.7635660.76356612...2540.196899690.0534882920.226357310.0240312610.202326
306347478CONTROL221182.49886976.83220317.1194030.8842150.1822700.3093620.309362159...4370.1976483960.1791047380.333786950.0429676430.290819
378974198CONTROL115181.06516161.66550812.5794960.0538660.2658560.9956560.99565672...2330.2024331480.1285843510.304952300.0260643210.278888
196211756CONTROL201055.43432850.91243810.8109450.2189050.1029850.2696520.269652111...3340.1661693130.1557215760.286567710.0353235050.251244
4059261852CONTROL295762.68684554.02976012.3777480.5052420.1816030.3834970.38349796...4690.1586074230.1430508160.275955760.0257027400.250254
\n", "

540 rows × 131 columns

\n", "
" ], "text/plain": [ " GROUP NUM_TWEETS \\\n", "USER_ID \n", "3635069722 DEPRESSIVE 1457 \n", "4475963063 DEPRESSIVE 2289 \n", "1571103494 DEPRESSIVE 825 \n", "4900730194 DEPRESSIVE 544 \n", "727283884786876416 DEPRESSIVE 2218 \n", "... ... ... \n", "3034331163 CONTROL 1290 \n", "306347478 CONTROL 2211 \n", "378974198 CONTROL 1151 \n", "196211756 CONTROL 2010 \n", "4059261852 CONTROL 2957 \n", "\n", " AVG_NUM_CHARS_PER_TWEET__ALL_CHARS \\\n", "USER_ID \n", "3635069722 65.642416 \n", "4475963063 52.017475 \n", "1571103494 67.032727 \n", "4900730194 70.345588 \n", "727283884786876416 72.946799 \n", "... ... \n", "3034331163 60.248837 \n", "306347478 82.498869 \n", "378974198 81.065161 \n", "196211756 55.434328 \n", "4059261852 62.686845 \n", "\n", " AVG_NUM_CHARS_PER_TWEET__NO_MENTION_HASH_LINK \\\n", "USER_ID \n", "3635069722 57.482498 \n", "4475963063 52.710791 \n", "1571103494 64.423030 \n", "4900730194 64.156250 \n", "727283884786876416 72.027953 \n", "... ... \n", "3034331163 45.580620 \n", "306347478 76.832203 \n", "378974198 61.665508 \n", "196211756 50.912438 \n", "4059261852 54.029760 \n", "\n", " AVG_NUM_WORDS_PER_TWEET__NO_MENTION_HASH_LINK \\\n", "USER_ID \n", "3635069722 12.027454 \n", "4475963063 11.259065 \n", "1571103494 12.950303 \n", "4900730194 13.750000 \n", "727283884786876416 15.227683 \n", "... ... \n", "3034331163 10.251938 \n", "306347478 17.119403 \n", "378974198 12.579496 \n", "196211756 10.810945 \n", "4059261852 12.377748 \n", "\n", " AVG_NUM_EMOJIS_PER_TWEET AVG_NUM_LINKS_PER_TWEET \\\n", "USER_ID \n", "3635069722 0.131091 0.349348 \n", "4475963063 0.010485 0.014417 \n", "1571103494 0.006061 0.078788 \n", "4900730194 0.099265 0.283088 \n", "727283884786876416 0.128945 0.059062 \n", "... ... ... \n", "3034331163 1.248837 0.280620 \n", "306347478 0.884215 0.182270 \n", "378974198 0.053866 0.265856 \n", "196211756 0.218905 0.102985 \n", "4059261852 0.505242 0.181603 \n", "\n", " AVG_NUM_MENTIONS_PER_TWEET AVG_NUM_HASHTAGS_PER_TWEET \\\n", "USER_ID \n", "3635069722 0.094029 0.094029 \n", "4475963063 0.012232 0.012232 \n", "1571103494 0.116364 0.116364 \n", "4900730194 0.042279 0.042279 \n", "727283884786876416 0.039225 0.039225 \n", "... ... ... \n", "3034331163 0.763566 0.763566 \n", "306347478 0.309362 0.309362 \n", "378974198 0.995656 0.995656 \n", "196211756 0.269652 0.269652 \n", "4059261852 0.383497 0.383497 \n", "\n", " NUM_TWEET_BY_DAY_HOUR__0 ... \\\n", "USER_ID ... \n", "3635069722 74 ... \n", "4475963063 226 ... \n", "1571103494 11 ... \n", "4900730194 15 ... \n", "727283884786876416 252 ... \n", "... ... ... \n", "3034331163 12 ... \n", "306347478 159 ... \n", "378974198 72 ... \n", "196211756 111 ... \n", "4059261852 96 ... \n", "\n", " NUM_TWEETS_WITH_SENTI_LEX_POLARITY_POS \\\n", "USER_ID \n", "3635069722 225 \n", "4475963063 452 \n", "1571103494 195 \n", "4900730194 100 \n", "727283884786876416 511 \n", "... ... \n", "3034331163 254 \n", "306347478 437 \n", "378974198 233 \n", "196211756 334 \n", "4059261852 469 \n", "\n", " PERC_TWEETS_WITH_SENTI_LEX_POLARITY_POS \\\n", "USER_ID \n", "3635069722 0.154427 \n", "4475963063 0.197466 \n", "1571103494 0.236364 \n", "4900730194 0.183824 \n", "727283884786876416 0.230388 \n", "... ... \n", "3034331163 0.196899 \n", "306347478 0.197648 \n", "378974198 0.202433 \n", "196211756 0.166169 \n", "4059261852 0.158607 \n", "\n", " NUM_TWEETS_WITH_SENTI_LEX_POLARITY_NEG \\\n", "USER_ID \n", "3635069722 337 \n", "4475963063 500 \n", "1571103494 201 \n", "4900730194 139 \n", "727283884786876416 599 \n", "... ... \n", "3034331163 69 \n", "306347478 396 \n", "378974198 148 \n", "196211756 313 \n", "4059261852 423 \n", "\n", " PERC_TWEETS_WITH_SENTI_LEX_POLARITY_NEG \\\n", "USER_ID \n", "3635069722 0.231297 \n", "4475963063 0.218436 \n", "1571103494 0.243636 \n", "4900730194 0.255515 \n", "727283884786876416 0.270063 \n", "... ... \n", "3034331163 0.053488 \n", "306347478 0.179104 \n", "378974198 0.128584 \n", "196211756 0.155721 \n", "4059261852 0.143050 \n", "\n", " NUM_TWEETS_WITH_SENTI_LEX_POLARITY \\\n", "USER_ID \n", "3635069722 497 \n", "4475963063 863 \n", "1571103494 351 \n", "4900730194 208 \n", "727283884786876416 986 \n", "... ... \n", "3034331163 292 \n", "306347478 738 \n", "378974198 351 \n", "196211756 576 \n", "4059261852 816 \n", "\n", " PERC_TWEETS_WITH_SENTI_LEX_POLARITY \\\n", "USER_ID \n", "3635069722 0.341112 \n", "4475963063 0.377021 \n", "1571103494 0.425455 \n", "4900730194 0.382353 \n", "727283884786876416 0.444545 \n", "... ... \n", "3034331163 0.226357 \n", "306347478 0.333786 \n", "378974198 0.304952 \n", "196211756 0.286567 \n", "4059261852 0.275955 \n", "\n", " NUM_TWEETS_WITH_MIXED_SENTI_LEX_POLARITY \\\n", "USER_ID \n", "3635069722 65 \n", "4475963063 89 \n", "1571103494 45 \n", "4900730194 31 \n", "727283884786876416 124 \n", "... ... \n", "3034331163 31 \n", "306347478 95 \n", "378974198 30 \n", "196211756 71 \n", "4059261852 76 \n", "\n", " PERC_TWEETS_WITH_MIXED_SENTI_LEX_POLARITY \\\n", "USER_ID \n", "3635069722 0.044612 \n", "4475963063 0.038882 \n", "1571103494 0.054545 \n", "4900730194 0.056985 \n", "727283884786876416 0.055906 \n", "... ... \n", "3034331163 0.024031 \n", "306347478 0.042967 \n", "378974198 0.026064 \n", "196211756 0.035323 \n", "4059261852 0.025702 \n", "\n", " NUM_TWEETS_WITH_SINGLE_SENTI_LEX_POLARITY \\\n", "USER_ID \n", "3635069722 432 \n", "4475963063 774 \n", "1571103494 306 \n", "4900730194 177 \n", "727283884786876416 862 \n", "... ... \n", "3034331163 261 \n", "306347478 643 \n", "378974198 321 \n", "196211756 505 \n", "4059261852 740 \n", "\n", " PERC_TWEETS_WITH_SINGLE_SENTI_LEX_POLARITY \n", "USER_ID \n", "3635069722 0.296500 \n", "4475963063 0.338139 \n", "1571103494 0.370909 \n", "4900730194 0.325368 \n", "727283884786876416 0.388638 \n", "... ... \n", "3034331163 0.202326 \n", "306347478 0.290819 \n", "378974198 0.278888 \n", "196211756 0.251244 \n", "4059261852 0.250254 \n", "\n", "[540 rows x 131 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_no_RTs" ] }, { "cell_type": "code", "execution_count": 7, "id": "e2031371-e22a-477d-82f0-4b3b15872c39", "metadata": {}, "outputs": [], "source": [ "cols = df_no_RTs.columns\n", "col_AVG_PERC = [col for col in cols if (col.startswith('AVG') or col.startswith('PERC'))]\n", "col_NUM = [col for col in cols if col.startswith('NUM')]" ] }, { "cell_type": "code", "execution_count": 8, "id": "7e8b1fed-9484-4c3e-83ed-931e80caa8b9", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['AVG_NUM_CHARS_PER_TWEET__ALL_CHARS',\n", " 'AVG_NUM_CHARS_PER_TWEET__NO_MENTION_HASH_LINK',\n", " 'AVG_NUM_WORDS_PER_TWEET__NO_MENTION_HASH_LINK',\n", " 'AVG_NUM_EMOJIS_PER_TWEET',\n", " 'AVG_NUM_LINKS_PER_TWEET',\n", " 'AVG_NUM_MENTIONS_PER_TWEET',\n", " 'AVG_NUM_HASHTAGS_PER_TWEET',\n", " 'PERC_TWEET_BY_DAY_HOUR__0',\n", " 'PERC_TWEET_BY_DAY_HOUR__1',\n", " 'PERC_TWEET_BY_DAY_HOUR__2',\n", " 'PERC_TWEET_BY_DAY_HOUR__3',\n", " 'PERC_TWEET_BY_DAY_HOUR__4',\n", " 'PERC_TWEET_BY_DAY_HOUR__5',\n", " 'PERC_TWEET_BY_DAY_HOUR__6',\n", " 'PERC_TWEET_BY_DAY_HOUR__7',\n", " 'PERC_TWEET_BY_DAY_HOUR__8',\n", " 'PERC_TWEET_BY_DAY_HOUR__9',\n", " 'PERC_TWEET_BY_DAY_HOUR__10',\n", " 'PERC_TWEET_BY_DAY_HOUR__11',\n", " 'PERC_TWEET_BY_DAY_HOUR__12',\n", " 'PERC_TWEET_BY_DAY_HOUR__13',\n", " 'PERC_TWEET_BY_DAY_HOUR__14',\n", " 'PERC_TWEET_BY_DAY_HOUR__15',\n", " 'PERC_TWEET_BY_DAY_HOUR__16',\n", " 'PERC_TWEET_BY_DAY_HOUR__17',\n", " 'PERC_TWEET_BY_DAY_HOUR__18',\n", " 'PERC_TWEET_BY_DAY_HOUR__19',\n", " 'PERC_TWEET_BY_DAY_HOUR__20',\n", " 'PERC_TWEET_BY_DAY_HOUR__21',\n", " 'PERC_TWEET_BY_DAY_HOUR__22',\n", " 'PERC_TWEET_BY_DAY_HOUR__23',\n", " 'PERC_TWEET_POSTED_FROM_23_TO_6',\n", " 'PERC_TWEET_BY_WEEK_DAY__0',\n", " 'PERC_TWEET_BY_WEEK_DAY__1',\n", " 'PERC_TWEET_BY_WEEK_DAY__2',\n", " 'PERC_TWEET_BY_WEEK_DAY__3',\n", " 'PERC_TWEET_BY_WEEK_DAY__4',\n", " 'PERC_TWEET_BY_WEEK_DAY__5',\n", " 'PERC_TWEET_BY_WEEK_DAY__6',\n", " 'PERC_TWEET_POSTED_WEEKEND',\n", " 'PERC_NOUNS',\n", " 'PERC_VERBS',\n", " 'PERC_ADVERBS',\n", " 'PERC_ADJECTIVES',\n", " 'PERC_PRONOUNS',\n", " 'PERC_PERSONAL_PRONOUNS',\n", " 'PERC_PERS_PRONOUNS_1S',\n", " 'PERC_PERS_PRONOUNS_2S',\n", " 'PERC_PERS_PRONOUNS_3S',\n", " 'PERC_PERS_PRONOUNS_1P',\n", " 'PERC_PERS_PRONOUNS_2P',\n", " 'PERC_PERS_PRONOUNS_3P',\n", " 'PERC_PERS_PRONOUNS_3N',\n", " 'PERC_TWEETS_EMOTION_Tristeza',\n", " 'PERC_TWEETS_EMOTION_Alegría',\n", " 'PERC_TWEETS_EMOTION_Enojo',\n", " 'PERC_TWEETS_EMOTION_Sorpresa',\n", " 'PERC_TWEETS_EMOTION_Repulsión',\n", " 'PERC_TWEETS_EMOTION_Miedo',\n", " 'PERC_TWEETS_WITH_EMOTION_WORDS',\n", " 'PERC_TWEETS_WITH_MIXED_EMOTIONS',\n", " 'PERC_TWEETS_WITH_SINGLE_EMOTIONS',\n", " 'PERC_TWEETS_WITH_NEGATION_WORDS',\n", " 'PERC_TWEETS_WITH_SENTI_LEX_POLARITY_POS',\n", " 'PERC_TWEETS_WITH_SENTI_LEX_POLARITY_NEG',\n", " 'PERC_TWEETS_WITH_SENTI_LEX_POLARITY',\n", " 'PERC_TWEETS_WITH_MIXED_SENTI_LEX_POLARITY',\n", " 'PERC_TWEETS_WITH_SINGLE_SENTI_LEX_POLARITY']" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "col_AVG_PERC" ] }, { "cell_type": "code", "execution_count": 9, "id": "ff41e82b-9793-44a3-9f5e-cb663aeedf8e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['NUM_TWEETS',\n", " 'NUM_TWEET_BY_DAY_HOUR__0',\n", " 'NUM_TWEET_BY_DAY_HOUR__1',\n", " 'NUM_TWEET_BY_DAY_HOUR__2',\n", " 'NUM_TWEET_BY_DAY_HOUR__3',\n", " 'NUM_TWEET_BY_DAY_HOUR__4',\n", " 'NUM_TWEET_BY_DAY_HOUR__5',\n", " 'NUM_TWEET_BY_DAY_HOUR__6',\n", " 'NUM_TWEET_BY_DAY_HOUR__7',\n", " 'NUM_TWEET_BY_DAY_HOUR__8',\n", " 'NUM_TWEET_BY_DAY_HOUR__9',\n", " 'NUM_TWEET_BY_DAY_HOUR__10',\n", " 'NUM_TWEET_BY_DAY_HOUR__11',\n", " 'NUM_TWEET_BY_DAY_HOUR__12',\n", " 'NUM_TWEET_BY_DAY_HOUR__13',\n", " 'NUM_TWEET_BY_DAY_HOUR__14',\n", " 'NUM_TWEET_BY_DAY_HOUR__15',\n", " 'NUM_TWEET_BY_DAY_HOUR__16',\n", " 'NUM_TWEET_BY_DAY_HOUR__17',\n", " 'NUM_TWEET_BY_DAY_HOUR__18',\n", " 'NUM_TWEET_BY_DAY_HOUR__19',\n", " 'NUM_TWEET_BY_DAY_HOUR__20',\n", " 'NUM_TWEET_BY_DAY_HOUR__21',\n", " 'NUM_TWEET_BY_DAY_HOUR__22',\n", " 'NUM_TWEET_BY_DAY_HOUR__23',\n", " 'NUM_TWEET_POSTED_FROM_23_TO_6',\n", " 'NUM_TWEET_BY_WEEK_DAY__0',\n", " 'NUM_TWEET_BY_WEEK_DAY__1',\n", " 'NUM_TWEET_BY_WEEK_DAY__2',\n", " 'NUM_TWEET_BY_WEEK_DAY__3',\n", " 'NUM_TWEET_BY_WEEK_DAY__4',\n", " 'NUM_TWEET_BY_WEEK_DAY__5',\n", " 'NUM_TWEET_BY_WEEK_DAY__6',\n", " 'NUM_TWEET_POSTED_WEEKEND',\n", " 'NUM_NOUNS',\n", " 'NUM_VERBS',\n", " 'NUM_ADVERBS',\n", " 'NUM_ADJECTIVES',\n", " 'NUM_PRONOUNS',\n", " 'NUM_PERSONAL_PRONOUNS',\n", " 'NUM_PERS_PRONOUNS_1S',\n", " 'NUM_PERS_PRONOUNS_2S',\n", " 'NUM_PERS_PRONOUNS_3S',\n", " 'NUM_PERS_PRONOUNS_1P',\n", " 'NUM_PERS_PRONOUNS_2P',\n", " 'NUM_PERS_PRONOUNS_3P',\n", " 'NUM_PERS_PRONOUNS_3N',\n", " 'NUM_TWEETS_EMOTION_Tristeza',\n", " 'NUM_TWEETS_EMOTION_Alegría',\n", " 'NUM_TWEETS_EMOTION_Enojo',\n", " 'NUM_TWEETS_EMOTION_Sorpresa',\n", " 'NUM_TWEETS_EMOTION_Repulsión',\n", " 'NUM_TWEETS_EMOTION_Miedo',\n", " 'NUM_TWEETS_WITH_EMOTION_WORDS',\n", " 'NUM_TWEETS_WITH_MIXED_EMOTIONS',\n", " 'NUM_TWEETS_WITH_SINGLE_EMOTIONS',\n", " 'NUM_TWEETS_WITH_NEGATION_WORDS',\n", " 'NUM_TWEETS_WITH_SENTI_LEX_POLARITY_POS',\n", " 'NUM_TWEETS_WITH_SENTI_LEX_POLARITY_NEG',\n", " 'NUM_TWEETS_WITH_SENTI_LEX_POLARITY',\n", " 'NUM_TWEETS_WITH_MIXED_SENTI_LEX_POLARITY',\n", " 'NUM_TWEETS_WITH_SINGLE_SENTI_LEX_POLARITY']" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "col_NUM" ] }, { "cell_type": "code", "execution_count": 10, "id": "07cf7bf9-7399-414d-a0f6-08933fb17a23", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NUM_TWEETSNUM_TWEET_BY_DAY_HOUR__23PERC_TWEET_BY_DAY_HOUR__23
USER_ID
36350697221457850.058339
447596306322892500.109218
1571103494825250.030303
4900730194544130.023897
72728388478687641622182220.100090
............
303433116312901460.113178
30634747822111020.046133
3789741981151490.042572
19621175620101020.050746
405926185229571470.049713
\n", "

540 rows × 3 columns

\n", "
" ], "text/plain": [ " NUM_TWEETS NUM_TWEET_BY_DAY_HOUR__23 \\\n", "USER_ID \n", "3635069722 1457 85 \n", "4475963063 2289 250 \n", "1571103494 825 25 \n", "4900730194 544 13 \n", "727283884786876416 2218 222 \n", "... ... ... \n", "3034331163 1290 146 \n", "306347478 2211 102 \n", "378974198 1151 49 \n", "196211756 2010 102 \n", "4059261852 2957 147 \n", "\n", " PERC_TWEET_BY_DAY_HOUR__23 \n", "USER_ID \n", "3635069722 0.058339 \n", "4475963063 0.109218 \n", "1571103494 0.030303 \n", "4900730194 0.023897 \n", "727283884786876416 0.100090 \n", "... ... \n", "3034331163 0.113178 \n", "306347478 0.046133 \n", "378974198 0.042572 \n", "196211756 0.050746 \n", "4059261852 0.049713 \n", "\n", "[540 rows x 3 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_no_RTs[['NUM_TWEETS','NUM_TWEET_BY_DAY_HOUR__23','PERC_TWEET_BY_DAY_HOUR__23']]" ] }, { "cell_type": "code", "execution_count": 11, "id": "c05b5ffb-5440-4b2a-9dc2-756fb1a7abeb", "metadata": {}, "outputs": [], "source": [ "df0 = df_no_RTs.loc[:, df_no_RTs.columns!='GROUP']" ] }, { "cell_type": "code", "execution_count": 12, "id": "cf50d127-3364-40f4-a573-3423001858c1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NUM_TWEETSAVG_NUM_CHARS_PER_TWEET__ALL_CHARSAVG_NUM_CHARS_PER_TWEET__NO_MENTION_HASH_LINKAVG_NUM_WORDS_PER_TWEET__NO_MENTION_HASH_LINKAVG_NUM_EMOJIS_PER_TWEETAVG_NUM_LINKS_PER_TWEETAVG_NUM_MENTIONS_PER_TWEETAVG_NUM_HASHTAGS_PER_TWEETNUM_TWEET_BY_DAY_HOUR__0PERC_TWEET_BY_DAY_HOUR__0...NUM_TWEETS_WITH_SENTI_LEX_POLARITY_POSPERC_TWEETS_WITH_SENTI_LEX_POLARITY_POSNUM_TWEETS_WITH_SENTI_LEX_POLARITY_NEGPERC_TWEETS_WITH_SENTI_LEX_POLARITY_NEGNUM_TWEETS_WITH_SENTI_LEX_POLARITYPERC_TWEETS_WITH_SENTI_LEX_POLARITYNUM_TWEETS_WITH_MIXED_SENTI_LEX_POLARITYPERC_TWEETS_WITH_MIXED_SENTI_LEX_POLARITYNUM_TWEETS_WITH_SINGLE_SENTI_LEX_POLARITYPERC_TWEETS_WITH_SINGLE_SENTI_LEX_POLARITY
USER_ID
3635069722145765.64241657.48249812.0274540.1310910.3493480.0940290.094029740.050789...2250.1544273370.2312974970.341112650.0446124320.296500
4475963063228952.01747552.71079111.2590650.0104850.0144170.0122320.0122322260.098733...4520.1974665000.2184368630.377021890.0388827740.338139
157110349482567.03272764.42303012.9503030.0060610.0787880.1163640.116364110.013333...1950.2363642010.2436363510.425455450.0545453060.370909
490073019454470.34558864.15625013.7500000.0992650.2830880.0422790.042279150.027574...1000.1838241390.2555152080.382353310.0569851770.325368
727283884786876416221872.94679972.02795315.2276830.1289450.0590620.0392250.0392252520.113616...5110.2303885990.2700639860.4445451240.0559068620.388638
..................................................................
3034331163129060.24883745.58062010.2519381.2488370.2806200.7635660.763566120.009302...2540.196899690.0534882920.226357310.0240312610.202326
306347478221182.49886976.83220317.1194030.8842150.1822700.3093620.3093621590.071913...4370.1976483960.1791047380.333786950.0429676430.290819
378974198115181.06516161.66550812.5794960.0538660.2658560.9956560.995656720.062554...2330.2024331480.1285843510.304952300.0260643210.278888
196211756201055.43432850.91243810.8109450.2189050.1029850.2696520.2696521110.055224...3340.1661693130.1557215760.286567710.0353235050.251244
4059261852295762.68684554.02976012.3777480.5052420.1816030.3834970.383497960.032465...4690.1586074230.1430508160.275955760.0257027400.250254
\n", "

540 rows × 130 columns

\n", "
" ], "text/plain": [ " NUM_TWEETS AVG_NUM_CHARS_PER_TWEET__ALL_CHARS \\\n", "USER_ID \n", "3635069722 1457 65.642416 \n", "4475963063 2289 52.017475 \n", "1571103494 825 67.032727 \n", "4900730194 544 70.345588 \n", "727283884786876416 2218 72.946799 \n", "... ... ... \n", "3034331163 1290 60.248837 \n", "306347478 2211 82.498869 \n", "378974198 1151 81.065161 \n", "196211756 2010 55.434328 \n", "4059261852 2957 62.686845 \n", "\n", " AVG_NUM_CHARS_PER_TWEET__NO_MENTION_HASH_LINK \\\n", "USER_ID \n", "3635069722 57.482498 \n", "4475963063 52.710791 \n", "1571103494 64.423030 \n", "4900730194 64.156250 \n", "727283884786876416 72.027953 \n", "... ... \n", "3034331163 45.580620 \n", "306347478 76.832203 \n", "378974198 61.665508 \n", "196211756 50.912438 \n", "4059261852 54.029760 \n", "\n", " AVG_NUM_WORDS_PER_TWEET__NO_MENTION_HASH_LINK \\\n", "USER_ID \n", "3635069722 12.027454 \n", "4475963063 11.259065 \n", "1571103494 12.950303 \n", "4900730194 13.750000 \n", "727283884786876416 15.227683 \n", "... ... \n", "3034331163 10.251938 \n", "306347478 17.119403 \n", "378974198 12.579496 \n", "196211756 10.810945 \n", "4059261852 12.377748 \n", "\n", " AVG_NUM_EMOJIS_PER_TWEET AVG_NUM_LINKS_PER_TWEET \\\n", "USER_ID \n", "3635069722 0.131091 0.349348 \n", "4475963063 0.010485 0.014417 \n", "1571103494 0.006061 0.078788 \n", "4900730194 0.099265 0.283088 \n", "727283884786876416 0.128945 0.059062 \n", "... ... ... \n", "3034331163 1.248837 0.280620 \n", "306347478 0.884215 0.182270 \n", "378974198 0.053866 0.265856 \n", "196211756 0.218905 0.102985 \n", "4059261852 0.505242 0.181603 \n", "\n", " AVG_NUM_MENTIONS_PER_TWEET AVG_NUM_HASHTAGS_PER_TWEET \\\n", "USER_ID \n", "3635069722 0.094029 0.094029 \n", "4475963063 0.012232 0.012232 \n", "1571103494 0.116364 0.116364 \n", "4900730194 0.042279 0.042279 \n", "727283884786876416 0.039225 0.039225 \n", "... ... ... \n", "3034331163 0.763566 0.763566 \n", "306347478 0.309362 0.309362 \n", "378974198 0.995656 0.995656 \n", "196211756 0.269652 0.269652 \n", "4059261852 0.383497 0.383497 \n", "\n", " NUM_TWEET_BY_DAY_HOUR__0 PERC_TWEET_BY_DAY_HOUR__0 ... \\\n", "USER_ID ... \n", "3635069722 74 0.050789 ... \n", "4475963063 226 0.098733 ... \n", "1571103494 11 0.013333 ... \n", "4900730194 15 0.027574 ... \n", "727283884786876416 252 0.113616 ... \n", "... ... ... ... \n", "3034331163 12 0.009302 ... \n", "306347478 159 0.071913 ... \n", "378974198 72 0.062554 ... \n", "196211756 111 0.055224 ... \n", "4059261852 96 0.032465 ... \n", "\n", " NUM_TWEETS_WITH_SENTI_LEX_POLARITY_POS \\\n", "USER_ID \n", "3635069722 225 \n", "4475963063 452 \n", "1571103494 195 \n", "4900730194 100 \n", "727283884786876416 511 \n", "... ... \n", "3034331163 254 \n", "306347478 437 \n", "378974198 233 \n", "196211756 334 \n", "4059261852 469 \n", "\n", " PERC_TWEETS_WITH_SENTI_LEX_POLARITY_POS \\\n", "USER_ID \n", "3635069722 0.154427 \n", "4475963063 0.197466 \n", "1571103494 0.236364 \n", "4900730194 0.183824 \n", "727283884786876416 0.230388 \n", "... ... \n", "3034331163 0.196899 \n", "306347478 0.197648 \n", "378974198 0.202433 \n", "196211756 0.166169 \n", "4059261852 0.158607 \n", "\n", " NUM_TWEETS_WITH_SENTI_LEX_POLARITY_NEG \\\n", "USER_ID \n", "3635069722 337 \n", "4475963063 500 \n", "1571103494 201 \n", "4900730194 139 \n", "727283884786876416 599 \n", "... ... \n", "3034331163 69 \n", "306347478 396 \n", "378974198 148 \n", "196211756 313 \n", "4059261852 423 \n", "\n", " PERC_TWEETS_WITH_SENTI_LEX_POLARITY_NEG \\\n", "USER_ID \n", "3635069722 0.231297 \n", "4475963063 0.218436 \n", "1571103494 0.243636 \n", "4900730194 0.255515 \n", "727283884786876416 0.270063 \n", "... ... \n", "3034331163 0.053488 \n", "306347478 0.179104 \n", "378974198 0.128584 \n", "196211756 0.155721 \n", "4059261852 0.143050 \n", "\n", " NUM_TWEETS_WITH_SENTI_LEX_POLARITY \\\n", "USER_ID \n", "3635069722 497 \n", "4475963063 863 \n", "1571103494 351 \n", "4900730194 208 \n", "727283884786876416 986 \n", "... ... \n", "3034331163 292 \n", "306347478 738 \n", "378974198 351 \n", "196211756 576 \n", "4059261852 816 \n", "\n", " PERC_TWEETS_WITH_SENTI_LEX_POLARITY \\\n", "USER_ID \n", "3635069722 0.341112 \n", "4475963063 0.377021 \n", "1571103494 0.425455 \n", "4900730194 0.382353 \n", "727283884786876416 0.444545 \n", "... ... \n", "3034331163 0.226357 \n", "306347478 0.333786 \n", "378974198 0.304952 \n", "196211756 0.286567 \n", "4059261852 0.275955 \n", "\n", " NUM_TWEETS_WITH_MIXED_SENTI_LEX_POLARITY \\\n", "USER_ID \n", "3635069722 65 \n", "4475963063 89 \n", "1571103494 45 \n", "4900730194 31 \n", "727283884786876416 124 \n", "... ... \n", "3034331163 31 \n", "306347478 95 \n", "378974198 30 \n", "196211756 71 \n", "4059261852 76 \n", "\n", " PERC_TWEETS_WITH_MIXED_SENTI_LEX_POLARITY \\\n", "USER_ID \n", "3635069722 0.044612 \n", "4475963063 0.038882 \n", "1571103494 0.054545 \n", "4900730194 0.056985 \n", "727283884786876416 0.055906 \n", "... ... \n", "3034331163 0.024031 \n", "306347478 0.042967 \n", "378974198 0.026064 \n", "196211756 0.035323 \n", "4059261852 0.025702 \n", "\n", " NUM_TWEETS_WITH_SINGLE_SENTI_LEX_POLARITY \\\n", "USER_ID \n", "3635069722 432 \n", "4475963063 774 \n", "1571103494 306 \n", "4900730194 177 \n", "727283884786876416 862 \n", "... ... \n", "3034331163 261 \n", "306347478 643 \n", "378974198 321 \n", "196211756 505 \n", "4059261852 740 \n", "\n", " PERC_TWEETS_WITH_SINGLE_SENTI_LEX_POLARITY \n", "USER_ID \n", "3635069722 0.296500 \n", "4475963063 0.338139 \n", "1571103494 0.370909 \n", "4900730194 0.325368 \n", "727283884786876416 0.388638 \n", "... ... \n", "3034331163 0.202326 \n", "306347478 0.290819 \n", "378974198 0.278888 \n", "196211756 0.251244 \n", "4059261852 0.250254 \n", "\n", "[540 rows x 130 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df0" ] }, { "cell_type": "markdown", "id": "35d321ab-11e1-4ba3-a2ca-8417b9380896", "metadata": {}, "source": [ "Prepare sinthetical variables for grouping time variables" ] }, { "cell_type": "code", "execution_count": 13, "id": "dba85603-ee86-4809-9424-e5b16b15271f", "metadata": {}, "outputs": [], "source": [ "df_no_RTs['NUM_TWEET_NIGHT'] = df_no_RTs['NUM_TWEET_BY_DAY_HOUR__0'] + \\\n", " df_no_RTs['NUM_TWEET_BY_DAY_HOUR__1'] + \\\n", " df_no_RTs['NUM_TWEET_BY_DAY_HOUR__2'] + \\\n", " df_no_RTs['NUM_TWEET_BY_DAY_HOUR__3'] + \\\n", " df_no_RTs['NUM_TWEET_BY_DAY_HOUR__4'] + \\\n", " df_no_RTs['NUM_TWEET_BY_DAY_HOUR__5'] \n", "\n", "df_no_RTs['PERC_TWEET_NIGHT'] = df_no_RTs['PERC_TWEET_BY_DAY_HOUR__0'] + \\\n", " df_no_RTs['PERC_TWEET_BY_DAY_HOUR__1'] + \\\n", " df_no_RTs['PERC_TWEET_BY_DAY_HOUR__2'] + \\\n", " df_no_RTs['PERC_TWEET_BY_DAY_HOUR__3'] + \\\n", " df_no_RTs['PERC_TWEET_BY_DAY_HOUR__4'] + \\\n", " df_no_RTs['PERC_TWEET_BY_DAY_HOUR__5'] \n", "\n", "\n", "\n", "df_no_RTs['NUM_TWEET_MORNING'] = df_no_RTs['NUM_TWEET_BY_DAY_HOUR__6'] + \\\n", " df_no_RTs['NUM_TWEET_BY_DAY_HOUR__7'] + \\\n", " df_no_RTs['NUM_TWEET_BY_DAY_HOUR__8'] + \\\n", " df_no_RTs['NUM_TWEET_BY_DAY_HOUR__9'] + \\\n", " df_no_RTs['NUM_TWEET_BY_DAY_HOUR__10'] + \\\n", " df_no_RTs['NUM_TWEET_BY_DAY_HOUR__11'] \n", "\n", "df_no_RTs['PERC_TWEET_MORNING'] = df_no_RTs['PERC_TWEET_BY_DAY_HOUR__6'] + \\\n", " df_no_RTs['PERC_TWEET_BY_DAY_HOUR__7'] + \\\n", " df_no_RTs['PERC_TWEET_BY_DAY_HOUR__8'] + \\\n", " df_no_RTs['PERC_TWEET_BY_DAY_HOUR__9'] + \\\n", " df_no_RTs['PERC_TWEET_BY_DAY_HOUR__10'] + \\\n", " df_no_RTs['PERC_TWEET_BY_DAY_HOUR__11'] \n", "\n", "\n", "\n", "df_no_RTs['NUM_TWEET_AFTERNOON'] = df_no_RTs['NUM_TWEET_BY_DAY_HOUR__12'] + \\\n", " df_no_RTs['NUM_TWEET_BY_DAY_HOUR__13'] + \\\n", " df_no_RTs['NUM_TWEET_BY_DAY_HOUR__14'] + \\\n", " df_no_RTs['NUM_TWEET_BY_DAY_HOUR__15'] + \\\n", " df_no_RTs['NUM_TWEET_BY_DAY_HOUR__16'] + \\\n", " df_no_RTs['NUM_TWEET_BY_DAY_HOUR__17'] \n", "\n", "df_no_RTs['PERC_TWEET_AFTERNOON'] = df_no_RTs['PERC_TWEET_BY_DAY_HOUR__12'] + \\\n", " df_no_RTs['PERC_TWEET_BY_DAY_HOUR__13'] + \\\n", " df_no_RTs['PERC_TWEET_BY_DAY_HOUR__14'] + \\\n", " df_no_RTs['PERC_TWEET_BY_DAY_HOUR__15'] + \\\n", " df_no_RTs['PERC_TWEET_BY_DAY_HOUR__16'] + \\\n", " df_no_RTs['PERC_TWEET_BY_DAY_HOUR__17'] \n", "\n", "\n", "\n", "df_no_RTs['NUM_TWEET_EVENING'] = df_no_RTs['NUM_TWEET_BY_DAY_HOUR__18'] + \\\n", " df_no_RTs['NUM_TWEET_BY_DAY_HOUR__19'] + \\\n", " df_no_RTs['NUM_TWEET_BY_DAY_HOUR__20'] + \\\n", " df_no_RTs['NUM_TWEET_BY_DAY_HOUR__21'] + \\\n", " df_no_RTs['NUM_TWEET_BY_DAY_HOUR__22'] + \\\n", " df_no_RTs['NUM_TWEET_BY_DAY_HOUR__23'] \n", "\n", "df_no_RTs['PERC_TWEET_EVENING'] = df_no_RTs['PERC_TWEET_BY_DAY_HOUR__18'] + \\\n", " df_no_RTs['PERC_TWEET_BY_DAY_HOUR__19'] + \\\n", " df_no_RTs['PERC_TWEET_BY_DAY_HOUR__20'] + \\\n", " df_no_RTs['PERC_TWEET_BY_DAY_HOUR__21'] + \\\n", " df_no_RTs['PERC_TWEET_BY_DAY_HOUR__22'] + \\\n", " df_no_RTs['PERC_TWEET_BY_DAY_HOUR__23'] " ] }, { "cell_type": "code", "execution_count": 14, "id": "bbdecf31-ab3c-4b7b-818f-66350a302f3c", "metadata": {}, "outputs": [], "source": [ "original_time_variables_NUM = ['NUM_TWEET_BY_DAY_HOUR__0',\n", " 'NUM_TWEET_BY_DAY_HOUR__1',\n", " 'NUM_TWEET_BY_DAY_HOUR__2',\n", " 'NUM_TWEET_BY_DAY_HOUR__3',\n", " 'NUM_TWEET_BY_DAY_HOUR__4',\n", " 'NUM_TWEET_BY_DAY_HOUR__5',\n", " 'NUM_TWEET_BY_DAY_HOUR__6',\n", " 'NUM_TWEET_BY_DAY_HOUR__7',\n", " 'NUM_TWEET_BY_DAY_HOUR__8',\n", " 'NUM_TWEET_BY_DAY_HOUR__9',\n", " 'NUM_TWEET_BY_DAY_HOUR__10',\n", " 'NUM_TWEET_BY_DAY_HOUR__11',\n", " 'NUM_TWEET_BY_DAY_HOUR__12',\n", " 'NUM_TWEET_BY_DAY_HOUR__13', \n", " 'NUM_TWEET_BY_DAY_HOUR__14',\n", " 'NUM_TWEET_BY_DAY_HOUR__15',\n", " 'NUM_TWEET_BY_DAY_HOUR__16',\n", " 'NUM_TWEET_BY_DAY_HOUR__17',\n", " 'NUM_TWEET_BY_DAY_HOUR__18',\n", " 'NUM_TWEET_BY_DAY_HOUR__19',\n", " 'NUM_TWEET_BY_DAY_HOUR__20',\n", " 'NUM_TWEET_BY_DAY_HOUR__21',\n", " 'NUM_TWEET_BY_DAY_HOUR__22',\n", " 'NUM_TWEET_BY_DAY_HOUR__23']\n", "\n", "original_time_variables_PERC = ['PERC_TWEET_BY_DAY_HOUR__0',\n", " 'PERC_TWEET_BY_DAY_HOUR__1',\n", " 'PERC_TWEET_BY_DAY_HOUR__2',\n", " 'PERC_TWEET_BY_DAY_HOUR__3',\n", " 'PERC_TWEET_BY_DAY_HOUR__4',\n", " 'PERC_TWEET_BY_DAY_HOUR__5',\n", " 'PERC_TWEET_BY_DAY_HOUR__6',\n", " 'PERC_TWEET_BY_DAY_HOUR__7',\n", " 'PERC_TWEET_BY_DAY_HOUR__8',\n", " 'PERC_TWEET_BY_DAY_HOUR__9',\n", " 'PERC_TWEET_BY_DAY_HOUR__10',\n", " 'PERC_TWEET_BY_DAY_HOUR__11',\n", " 'PERC_TWEET_BY_DAY_HOUR__12',\n", " 'PERC_TWEET_BY_DAY_HOUR__13', \n", " 'PERC_TWEET_BY_DAY_HOUR__14',\n", " 'PERC_TWEET_BY_DAY_HOUR__15',\n", " 'PERC_TWEET_BY_DAY_HOUR__16',\n", " 'PERC_TWEET_BY_DAY_HOUR__17',\n", " 'PERC_TWEET_BY_DAY_HOUR__18',\n", " 'PERC_TWEET_BY_DAY_HOUR__19',\n", " 'PERC_TWEET_BY_DAY_HOUR__20',\n", " 'PERC_TWEET_BY_DAY_HOUR__21',\n", " 'PERC_TWEET_BY_DAY_HOUR__22',\n", " 'PERC_TWEET_BY_DAY_HOUR__23']\n", "\n", "aggregated_time_variables_NUM = ['NUM_TWEET_NIGHT',\n", " 'NUM_TWEET_MORNING',\n", " 'NUM_TWEET_AFTERNOON',\n", " 'NUM_TWEET_EVENING']\n", "aggregated_time_variables_PERC = ['PERC_TWEET_NIGHT',\n", " 'PERC_TWEET_MORNING',\n", " 'PERC_TWEET_AFTERNOON',\n", " 'PERC_TWEET_EVENING']" ] }, { "cell_type": "code", "execution_count": 15, "id": "30acd3c4-7a29-4f1f-8294-0fd24ec2f9fd", "metadata": {}, "outputs": [], "source": [ "for element in col_NUM:\n", " if element in original_time_variables_NUM:\n", " col_NUM.remove(element)\n", "\n", "\n", "for element in col_AVG_PERC:\n", " if element in original_time_variables_PERC:\n", " col_AVG_PERC.remove(element)" ] }, { "cell_type": "markdown", "id": "8d138734-1b98-447b-9f43-f302d4cb2d34", "metadata": {}, "source": [ "| Dataset | NUM vs AVG-PERC | Time variables | PCA |\n", "|:-:|:-:|:-:|:-:|\n", "| **df0** | Both | Original | No |\n", "| | | | |\n", "| **df1** | NUM | Original | No |\n", "| **df2** | PERC/AVG | Original | No |\n", "| **df3** | NUM | Grouped | No |\n", "| **df4** | PERC/AVG | Grouped | No |\n", "| **df5** | NUM | Original | Yes |\n", "| **df6** | PERC/AVG | Original | Yes |\n", "| **df7** | NUM | Grouped | Yes |\n", "| **df8** | PERC/AVG | Grouped | Yes |" ] }, { "cell_type": "code", "execution_count": 16, "id": "a839a78d-8435-4ac8-8300-d0d42290a587", "metadata": {}, "outputs": [], "source": [ "df1 = df_no_RTs[col_NUM]\n", "df2 = df_no_RTs[col_AVG_PERC]\n", "df3 = df_no_RTs[col_NUM + aggregated_time_variables_NUM]\n", "df4 = df_no_RTs[col_AVG_PERC + aggregated_time_variables_PERC]" ] }, { "cell_type": "code", "execution_count": 17, "id": "6244db3d-3c54-4e3d-9cde-4868168ed557", "metadata": {}, "outputs": [], "source": [ "scaler = MinMaxScaler()\n", "df1_scaled = scaler.fit_transform(df1)\n", "df2_scaled = scaler.fit_transform(df2)\n", "df3_scaled = scaler.fit_transform(df3)\n", "df4_scaled = scaler.fit_transform(df4)" ] }, { "cell_type": "markdown", "id": "d5ee67df-0b0d-482c-b17f-5b5eb2666608", "metadata": {}, "source": [ "PCA with 95% variance" ] }, { "cell_type": "code", "execution_count": 18, "id": "240ae2cf-1910-48dd-b8d3-c2179d733634", "metadata": {}, "outputs": [], "source": [ "pca = PCA(n_components = 0.95)\n", "\n", "pca.fit(df1_scaled)\n", "df5 = pd.DataFrame(pca.transform(df1_scaled))\n", "\n", "pca.fit(df2_scaled)\n", "df6 = pd.DataFrame(pca.transform(df2_scaled))\n", "\n", "pca.fit(df3_scaled)\n", "df7 = pd.DataFrame(pca.transform(df3_scaled))\n", "\n", "pca.fit(df4_scaled)\n", "df8 = pd.DataFrame(pca.transform(df4_scaled))" ] }, { "cell_type": "code", "execution_count": 19, "id": "b493dc12-33c5-4b0d-84d8-24deb899c24f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dataset 0, # variables: 130\n", "Dataset 1, # variables: 50\n", "Dataset 2, # variables: 56\n", "Dataset 3, # variables: 54\n", "Dataset 4, # variables: 60\n", "Dataset 5, # variables: 17\n", "Dataset 6, # variables: 28\n", "Dataset 7, # variables: 16\n", "Dataset 8, # variables: 28\n" ] } ], "source": [ "print('Dataset 0, # variables: ', len(df0.columns))\n", "print('Dataset 1, # variables: ', len(df1.columns))\n", "print('Dataset 2, # variables: ', len(df2.columns))\n", "print('Dataset 3, # variables: ', len(df3.columns))\n", "print('Dataset 4, # variables: ', len(df4.columns))\n", "print('Dataset 5, # variables: ', len(df5.columns))\n", "print('Dataset 6, # variables: ', len(df6.columns))\n", "print('Dataset 7, # variables: ', len(df7.columns))\n", "print('Dataset 8, # variables: ', len(df8.columns))" ] }, { "cell_type": "markdown", "id": "a50b22bd-2453-4cb1-a2a4-02878a176f15", "metadata": {}, "source": [ "Add target variable GROUP to everydataset" ] }, { "cell_type": "code", "execution_count": 20, "id": "3ffcb9ca-b01e-4f34-bfe8-082b2aec7b0f", "metadata": { "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\Lucia\\AppData\\Local\\Temp/ipykernel_24304/584712795.py:2: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " df0_w_t['GROUP'] = df_no_RTs['GROUP']\n", "C:\\Users\\Lucia\\AppData\\Local\\Temp/ipykernel_24304/584712795.py:5: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " df1_w_t['GROUP'] = df_no_RTs['GROUP']\n", "C:\\Users\\Lucia\\AppData\\Local\\Temp/ipykernel_24304/584712795.py:8: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " df2_w_t['GROUP'] = df_no_RTs['GROUP']\n", "C:\\Users\\Lucia\\AppData\\Local\\Temp/ipykernel_24304/584712795.py:11: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " df3_w_t['GROUP'] = df_no_RTs['GROUP']\n", "C:\\Users\\Lucia\\AppData\\Local\\Temp/ipykernel_24304/584712795.py:14: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " df4_w_t['GROUP'] = df_no_RTs['GROUP']\n" ] } ], "source": [ "df0_w_t = df0\n", "df0_w_t['GROUP'] = df_no_RTs['GROUP']\n", "\n", "df1_w_t = df1\n", "df1_w_t['GROUP'] = df_no_RTs['GROUP']\n", "\n", "df2_w_t = df2\n", "df2_w_t['GROUP'] = df_no_RTs['GROUP']\n", "\n", "df3_w_t = df3\n", "df3_w_t['GROUP'] = df_no_RTs['GROUP']\n", "\n", "df4_w_t = df4\n", "df4_w_t['GROUP'] = df_no_RTs['GROUP']\n", "\n", "df5_w_t = df5\n", "df5_w_t['GROUP'] = df_no_RTs['GROUP'].to_list()\n", "\n", "df6_w_t = df6\n", "df6_w_t['GROUP'] = df_no_RTs['GROUP'].to_list()\n", "\n", "df7_w_t = df7\n", "df7_w_t['GROUP'] = df_no_RTs['GROUP'].to_list()\n", "\n", "df8_w_t = df8\n", "df8_w_t['GROUP'] = df_no_RTs['GROUP'].to_list()" ] }, { "cell_type": "code", "execution_count": 21, "id": "ed95b852-e61e-4a7c-a3f1-474e0168715d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NUM_TWEETSAVG_NUM_CHARS_PER_TWEET__ALL_CHARSAVG_NUM_CHARS_PER_TWEET__NO_MENTION_HASH_LINKAVG_NUM_WORDS_PER_TWEET__NO_MENTION_HASH_LINKAVG_NUM_EMOJIS_PER_TWEETAVG_NUM_LINKS_PER_TWEETAVG_NUM_MENTIONS_PER_TWEETAVG_NUM_HASHTAGS_PER_TWEETNUM_TWEET_BY_DAY_HOUR__0PERC_TWEET_BY_DAY_HOUR__0...PERC_TWEETS_WITH_SENTI_LEX_POLARITY_POSNUM_TWEETS_WITH_SENTI_LEX_POLARITY_NEGPERC_TWEETS_WITH_SENTI_LEX_POLARITY_NEGNUM_TWEETS_WITH_SENTI_LEX_POLARITYPERC_TWEETS_WITH_SENTI_LEX_POLARITYNUM_TWEETS_WITH_MIXED_SENTI_LEX_POLARITYPERC_TWEETS_WITH_MIXED_SENTI_LEX_POLARITYNUM_TWEETS_WITH_SINGLE_SENTI_LEX_POLARITYPERC_TWEETS_WITH_SINGLE_SENTI_LEX_POLARITYGROUP
USER_ID
3635069722145765.64241657.48249812.0274540.1310910.3493480.0940290.094029740.050789...0.1544273370.2312974970.341112650.0446124320.296500DEPRESSIVE
4475963063228952.01747552.71079111.2590650.0104850.0144170.0122320.0122322260.098733...0.1974665000.2184368630.377021890.0388827740.338139DEPRESSIVE
157110349482567.03272764.42303012.9503030.0060610.0787880.1163640.116364110.013333...0.2363642010.2436363510.425455450.0545453060.370909DEPRESSIVE
490073019454470.34558864.15625013.7500000.0992650.2830880.0422790.042279150.027574...0.1838241390.2555152080.382353310.0569851770.325368DEPRESSIVE
727283884786876416221872.94679972.02795315.2276830.1289450.0590620.0392250.0392252520.113616...0.2303885990.2700639860.4445451240.0559068620.388638DEPRESSIVE
..................................................................
3034331163129060.24883745.58062010.2519381.2488370.2806200.7635660.763566120.009302...0.196899690.0534882920.226357310.0240312610.202326CONTROL
306347478221182.49886976.83220317.1194030.8842150.1822700.3093620.3093621590.071913...0.1976483960.1791047380.333786950.0429676430.290819CONTROL
378974198115181.06516161.66550812.5794960.0538660.2658560.9956560.995656720.062554...0.2024331480.1285843510.304952300.0260643210.278888CONTROL
196211756201055.43432850.91243810.8109450.2189050.1029850.2696520.2696521110.055224...0.1661693130.1557215760.286567710.0353235050.251244CONTROL
4059261852295762.68684554.02976012.3777480.5052420.1816030.3834970.383497960.032465...0.1586074230.1430508160.275955760.0257027400.250254CONTROL
\n", "

540 rows × 131 columns

\n", "
" ], "text/plain": [ " NUM_TWEETS AVG_NUM_CHARS_PER_TWEET__ALL_CHARS \\\n", "USER_ID \n", "3635069722 1457 65.642416 \n", "4475963063 2289 52.017475 \n", "1571103494 825 67.032727 \n", "4900730194 544 70.345588 \n", "727283884786876416 2218 72.946799 \n", "... ... ... \n", "3034331163 1290 60.248837 \n", "306347478 2211 82.498869 \n", "378974198 1151 81.065161 \n", "196211756 2010 55.434328 \n", "4059261852 2957 62.686845 \n", "\n", " AVG_NUM_CHARS_PER_TWEET__NO_MENTION_HASH_LINK \\\n", "USER_ID \n", "3635069722 57.482498 \n", "4475963063 52.710791 \n", "1571103494 64.423030 \n", "4900730194 64.156250 \n", "727283884786876416 72.027953 \n", "... ... \n", "3034331163 45.580620 \n", "306347478 76.832203 \n", "378974198 61.665508 \n", "196211756 50.912438 \n", "4059261852 54.029760 \n", "\n", " AVG_NUM_WORDS_PER_TWEET__NO_MENTION_HASH_LINK \\\n", "USER_ID \n", "3635069722 12.027454 \n", "4475963063 11.259065 \n", "1571103494 12.950303 \n", "4900730194 13.750000 \n", "727283884786876416 15.227683 \n", "... ... \n", "3034331163 10.251938 \n", "306347478 17.119403 \n", "378974198 12.579496 \n", "196211756 10.810945 \n", "4059261852 12.377748 \n", "\n", " AVG_NUM_EMOJIS_PER_TWEET AVG_NUM_LINKS_PER_TWEET \\\n", "USER_ID \n", "3635069722 0.131091 0.349348 \n", "4475963063 0.010485 0.014417 \n", "1571103494 0.006061 0.078788 \n", "4900730194 0.099265 0.283088 \n", "727283884786876416 0.128945 0.059062 \n", "... ... ... \n", "3034331163 1.248837 0.280620 \n", "306347478 0.884215 0.182270 \n", "378974198 0.053866 0.265856 \n", "196211756 0.218905 0.102985 \n", "4059261852 0.505242 0.181603 \n", "\n", " AVG_NUM_MENTIONS_PER_TWEET AVG_NUM_HASHTAGS_PER_TWEET \\\n", "USER_ID \n", "3635069722 0.094029 0.094029 \n", "4475963063 0.012232 0.012232 \n", "1571103494 0.116364 0.116364 \n", "4900730194 0.042279 0.042279 \n", "727283884786876416 0.039225 0.039225 \n", "... ... ... \n", "3034331163 0.763566 0.763566 \n", "306347478 0.309362 0.309362 \n", "378974198 0.995656 0.995656 \n", "196211756 0.269652 0.269652 \n", "4059261852 0.383497 0.383497 \n", "\n", " NUM_TWEET_BY_DAY_HOUR__0 PERC_TWEET_BY_DAY_HOUR__0 ... \\\n", "USER_ID ... \n", "3635069722 74 0.050789 ... \n", "4475963063 226 0.098733 ... \n", "1571103494 11 0.013333 ... \n", "4900730194 15 0.027574 ... \n", "727283884786876416 252 0.113616 ... \n", "... ... ... ... \n", "3034331163 12 0.009302 ... \n", "306347478 159 0.071913 ... \n", "378974198 72 0.062554 ... \n", "196211756 111 0.055224 ... \n", "4059261852 96 0.032465 ... \n", "\n", " PERC_TWEETS_WITH_SENTI_LEX_POLARITY_POS \\\n", "USER_ID \n", "3635069722 0.154427 \n", "4475963063 0.197466 \n", "1571103494 0.236364 \n", "4900730194 0.183824 \n", "727283884786876416 0.230388 \n", "... ... \n", "3034331163 0.196899 \n", "306347478 0.197648 \n", "378974198 0.202433 \n", "196211756 0.166169 \n", "4059261852 0.158607 \n", "\n", " NUM_TWEETS_WITH_SENTI_LEX_POLARITY_NEG \\\n", "USER_ID \n", "3635069722 337 \n", "4475963063 500 \n", "1571103494 201 \n", "4900730194 139 \n", "727283884786876416 599 \n", "... ... \n", "3034331163 69 \n", "306347478 396 \n", "378974198 148 \n", "196211756 313 \n", "4059261852 423 \n", "\n", " PERC_TWEETS_WITH_SENTI_LEX_POLARITY_NEG \\\n", "USER_ID \n", "3635069722 0.231297 \n", "4475963063 0.218436 \n", "1571103494 0.243636 \n", "4900730194 0.255515 \n", "727283884786876416 0.270063 \n", "... ... \n", "3034331163 0.053488 \n", "306347478 0.179104 \n", "378974198 0.128584 \n", "196211756 0.155721 \n", "4059261852 0.143050 \n", "\n", " NUM_TWEETS_WITH_SENTI_LEX_POLARITY \\\n", "USER_ID \n", "3635069722 497 \n", "4475963063 863 \n", "1571103494 351 \n", "4900730194 208 \n", "727283884786876416 986 \n", "... ... \n", "3034331163 292 \n", "306347478 738 \n", "378974198 351 \n", "196211756 576 \n", "4059261852 816 \n", "\n", " PERC_TWEETS_WITH_SENTI_LEX_POLARITY \\\n", "USER_ID \n", "3635069722 0.341112 \n", "4475963063 0.377021 \n", "1571103494 0.425455 \n", "4900730194 0.382353 \n", "727283884786876416 0.444545 \n", "... ... \n", "3034331163 0.226357 \n", "306347478 0.333786 \n", "378974198 0.304952 \n", "196211756 0.286567 \n", "4059261852 0.275955 \n", "\n", " NUM_TWEETS_WITH_MIXED_SENTI_LEX_POLARITY \\\n", "USER_ID \n", "3635069722 65 \n", "4475963063 89 \n", "1571103494 45 \n", "4900730194 31 \n", "727283884786876416 124 \n", "... ... \n", "3034331163 31 \n", "306347478 95 \n", "378974198 30 \n", "196211756 71 \n", "4059261852 76 \n", "\n", " PERC_TWEETS_WITH_MIXED_SENTI_LEX_POLARITY \\\n", "USER_ID \n", "3635069722 0.044612 \n", "4475963063 0.038882 \n", "1571103494 0.054545 \n", "4900730194 0.056985 \n", "727283884786876416 0.055906 \n", "... ... \n", "3034331163 0.024031 \n", "306347478 0.042967 \n", "378974198 0.026064 \n", "196211756 0.035323 \n", "4059261852 0.025702 \n", "\n", " NUM_TWEETS_WITH_SINGLE_SENTI_LEX_POLARITY \\\n", "USER_ID \n", "3635069722 432 \n", "4475963063 774 \n", "1571103494 306 \n", "4900730194 177 \n", "727283884786876416 862 \n", "... ... \n", "3034331163 261 \n", "306347478 643 \n", "378974198 321 \n", "196211756 505 \n", "4059261852 740 \n", "\n", " PERC_TWEETS_WITH_SINGLE_SENTI_LEX_POLARITY GROUP \n", "USER_ID \n", "3635069722 0.296500 DEPRESSIVE \n", "4475963063 0.338139 DEPRESSIVE \n", "1571103494 0.370909 DEPRESSIVE \n", "4900730194 0.325368 DEPRESSIVE \n", "727283884786876416 0.388638 DEPRESSIVE \n", "... ... ... \n", "3034331163 0.202326 CONTROL \n", "306347478 0.290819 CONTROL \n", "378974198 0.278888 CONTROL \n", "196211756 0.251244 CONTROL \n", "4059261852 0.250254 CONTROL \n", "\n", "[540 rows x 131 columns]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df0_w_t" ] }, { "cell_type": "markdown", "id": "4369affd-ddc7-4ae0-8196-b82b4bd27126", "metadata": {}, "source": [ "\n" ] }, { "cell_type": "markdown", "id": "8278d84f-63ad-4e2c-a697-cbcc159b881b", "metadata": {}, "source": [ "\n" ] }, { "cell_type": "markdown", "id": "cb980831-bed6-442e-9f4c-8832fe533849", "metadata": {}, "source": [ "\n" ] }, { "cell_type": "markdown", "id": "bb3147ab-398a-4289-9545-334c402c8238", "metadata": {}, "source": [ "\n" ] }, { "cell_type": "markdown", "id": "b23ded21-b784-453b-83a1-4ee787f80674", "metadata": {}, "source": [ "\n" ] }, { "cell_type": "markdown", "id": "a48edce7-40ea-4997-9698-54e3534e7837", "metadata": {}, "source": [ "\n" ] }, { "cell_type": "markdown", "id": "f152975d-8402-446d-8863-8e7f852092ef", "metadata": {}, "source": [ "\n" ] }, { "cell_type": "code", "execution_count": 22, "id": "fda6e7fe-d12e-43c2-95e2-cde7ff27f497", "metadata": {}, "outputs": [], "source": [ "df0_w_t.to_csv('datasets/d0.csv', index=False, sep=';')\n", "df1_w_t.to_csv('datasets/d1.csv', index=False, sep=';')\n", "df2_w_t.to_csv('datasets/d2.csv', index=False, sep=';')\n", "df3_w_t.to_csv('datasets/d3.csv', index=False, sep=';')\n", "df4_w_t.to_csv('datasets/d4.csv', index=False, sep=';')\n", "df5_w_t.to_csv('datasets/d5.csv', index=False, sep=';')\n", "df6_w_t.to_csv('datasets/d6.csv', index=False, sep=';')\n", "df7_w_t.to_csv('datasets/d7.csv', index=False, sep=';')\n", "df8_w_t.to_csv('datasets/d8.csv', index=False, sep=';')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 5 }