본문 바로가기

Big Data

[Pandas] 시계열 데이터(timestamp) 처리하기

게임 관련 데이터를 분석하다가 접속 로그(시계열 객체)를 다루게 되어 기본적인 내용을 정리해보았습니다.


우선 샘플 데이터를 생성해 보겠습니다.

'연-월-일 시:분:초'로 이루어진 문자열 형태의 데이터입니다. (timestamp의 기본 형태)

 

In [1]:

import pandas as pd

In [2]:

# sample data
df = pd.DataFrame()
df['timestamp'] = ['2021-2-3 1:30:1.273823', '2021-2-3 3:24:5.382712', '2021-2-3 10:19:13.293104', '2021-2-4 1:50:32.38172', '2021-2-4 13:47:9.600381', '2021-2-4 12:30:1.34521']
df

Out[2]:

  timestamp
0 2021-2-3 1:30:1.273823
1 2021-2-3 3:24:5.382712
2 2021-2-3 10:19:13.293104
3 2021-2-4 1:50:32.38172
4 2021-2-4 13:47:9.600381
5 2021-2-4 12:30:1.34521

In [3]:

type(df['timestamp'][0])

Out[3]:

str

 

문자열로 된 날짜/시간을 시계열 객체(Timestamp)형식으로 변환

In [4]:

df['timestamp'] = pd.to_datetime(df['timestamp'])

In [5]:

type(df['timestamp'][0])

Out[5]:

pandas._libs.tslibs.timestamps.Timestamp

 

시계열 데이터 간의 차이 구하기

In [6]:

df['time_gap'] = df['timestamp'].diff()
df

Out[6]:

  timestamp time_gap
0 2021-02-03 01:30:01.273823 NaT
1 2021-02-03 03:24:05.382712 0 days 01:54:04.108889
2 2021-02-03 10:19:13.293104 0 days 06:55:07.910392
3 2021-02-04 01:50:32.381720 0 days 15:31:19.088616
4 2021-02-04 13:47:09.600381 0 days 11:56:37.218661
5 2021-02-04 12:30:01.345210 -1 days +22:42:51.744829

 

결측치 처리하기

In [7]:

import datetime as dt

df.replace({pd.NaT: dt.timedelta(days = 0, hours = 0, minutes = 0, seconds = 0)}, inplace=True)
df

Out[7]:

  timestamp time_gap
0 2021-02-03 01:30:01.273823 0 days 00:00:00
1 2021-02-03 03:24:05.382712 0 days 01:54:04.108889
2 2021-02-03 10:19:13.293104 0 days 06:55:07.910392
3 2021-02-04 01:50:32.381720 0 days 15:31:19.088616
4 2021-02-04 13:47:09.600381 0 days 11:56:37.218661
5 2021-02-04 12:30:01.345210 -1 days +22:42:51.744829

 

Timedelta 객체 인스턴스화 하여 연산하기

 

  • time_gap이 2시간 미만인 것을 구할 경우

In [8]:

twohour = dt.timedelta(days = 0, hours = 2)

In [9]:

for i in range(len(df)):
    if df['time_gap'][i] < twohour:
        print(df['time_gap'][i])
0 days 00:00:00
0 days 01:54:04.108889
-1 days +22:42:51.744829

 

  • time_gap이 음수인 것을 구할 경우

In [10]:

zerosec = dt.timedelta(seconds = 0)

In [11]:

for i in range(len(df)):
    if df['time_gap'][i] < zerosec:
        print(df['time_gap'][i])
-1 days +22:42:51.744829

 

날짜와 시간을 분리하기

In [12]:

df['date'] = df['timestamp'].dt.date
df['time'] = df['timestamp'].dt.time
df

Out[12]:

  timestamp time_gap date time
0 2021-02-03 01:30:01.273823 0 days 00:00:00 2021-02-03 01:30:01.273823
1 2021-02-03 03:24:05.382712 0 days 01:54:04.108889 2021-02-03 03:24:05.382712
2 2021-02-03 10:19:13.293104 0 days 06:55:07.910392 2021-02-03 10:19:13.293104
3 2021-02-04 01:50:32.381720 0 days 15:31:19.088616 2021-02-04 01:50:32.381720
4 2021-02-04 13:47:09.600381 0 days 11:56:37.218661 2021-02-04 13:47:09.600381
5 2021-02-04 12:30:01.345210 -1 days +22:42:51.744829 2021-02-04 12:30:01.345210

In [13]:

type(df['date'][0])

Out[13]:

datetime.date

In [14]:

type(df['time'][0])

Out[14]:

datetime.time

 

참고로 datetime.date끼리는 diff()연산이 가능하지만, datetime.time끼리는 diff()연산이 불가능하다.

 

In [15]:

date_diff = df['date'].diff()
date_diff

Out[15]:

0      NaT
1   0 days
2   0 days
3   1 days
4   0 days
5   0 days
Name: date, dtype: timedelta64[ns]

In [16]:

date_diff = df['date'][3] - df['date'][2]
date_diff

Out[16]:

datetime.timedelta(days=1)