게임 관련 데이터를 분석하다가 접속 로그(시계열 객체)를 다루게 되어 기본적인 내용을 정리해보았습니다.
우선 샘플 데이터를 생성해 보겠습니다.
'연-월-일 시:분:초'로 이루어진 문자열 형태의 데이터입니다. (timestamp의 기본 형태)
In [1]:
import pandas as pd
In [2]:
# sample data
df = pd.DataFrame()
df['timestamp'] = ['2021-2-3 1:30:1.273823', '2021-2-3 3:24:5.382712', '2021-2-3 10:19:13.293104', '2021-2-4 1:50:32.38172', '2021-2-4 13:47:9.600381', '2021-2-4 12:30:1.34521']
df
Out[2]:
timestamp | |
---|---|
0 | 2021-2-3 1:30:1.273823 |
1 | 2021-2-3 3:24:5.382712 |
2 | 2021-2-3 10:19:13.293104 |
3 | 2021-2-4 1:50:32.38172 |
4 | 2021-2-4 13:47:9.600381 |
5 | 2021-2-4 12:30:1.34521 |
In [3]:
type(df['timestamp'][0])
Out[3]:
str
문자열로 된 날짜/시간을 시계열 객체(Timestamp)형식으로 변환
In [4]:
df['timestamp'] = pd.to_datetime(df['timestamp'])
In [5]:
type(df['timestamp'][0])
Out[5]:
pandas._libs.tslibs.timestamps.Timestamp
시계열 데이터 간의 차이 구하기
In [6]:
df['time_gap'] = df['timestamp'].diff()
df
Out[6]:
timestamp | time_gap | |
---|---|---|
0 | 2021-02-03 01:30:01.273823 | NaT |
1 | 2021-02-03 03:24:05.382712 | 0 days 01:54:04.108889 |
2 | 2021-02-03 10:19:13.293104 | 0 days 06:55:07.910392 |
3 | 2021-02-04 01:50:32.381720 | 0 days 15:31:19.088616 |
4 | 2021-02-04 13:47:09.600381 | 0 days 11:56:37.218661 |
5 | 2021-02-04 12:30:01.345210 | -1 days +22:42:51.744829 |
결측치 처리하기
In [7]:
import datetime as dt
df.replace({pd.NaT: dt.timedelta(days = 0, hours = 0, minutes = 0, seconds = 0)}, inplace=True)
df
Out[7]:
timestamp | time_gap | |
---|---|---|
0 | 2021-02-03 01:30:01.273823 | 0 days 00:00:00 |
1 | 2021-02-03 03:24:05.382712 | 0 days 01:54:04.108889 |
2 | 2021-02-03 10:19:13.293104 | 0 days 06:55:07.910392 |
3 | 2021-02-04 01:50:32.381720 | 0 days 15:31:19.088616 |
4 | 2021-02-04 13:47:09.600381 | 0 days 11:56:37.218661 |
5 | 2021-02-04 12:30:01.345210 | -1 days +22:42:51.744829 |
Timedelta 객체 인스턴스화 하여 연산하기
- time_gap이 2시간 미만인 것을 구할 경우
In [8]:
twohour = dt.timedelta(days = 0, hours = 2)
In [9]:
for i in range(len(df)):
if df['time_gap'][i] < twohour:
print(df['time_gap'][i])
0 days 00:00:00
0 days 01:54:04.108889
-1 days +22:42:51.744829
- time_gap이 음수인 것을 구할 경우
In [10]:
zerosec = dt.timedelta(seconds = 0)
In [11]:
for i in range(len(df)):
if df['time_gap'][i] < zerosec:
print(df['time_gap'][i])
-1 days +22:42:51.744829
날짜와 시간을 분리하기
In [12]:
df['date'] = df['timestamp'].dt.date
df['time'] = df['timestamp'].dt.time
df
Out[12]:
timestamp | time_gap | date | time | |
---|---|---|---|---|
0 | 2021-02-03 01:30:01.273823 | 0 days 00:00:00 | 2021-02-03 | 01:30:01.273823 |
1 | 2021-02-03 03:24:05.382712 | 0 days 01:54:04.108889 | 2021-02-03 | 03:24:05.382712 |
2 | 2021-02-03 10:19:13.293104 | 0 days 06:55:07.910392 | 2021-02-03 | 10:19:13.293104 |
3 | 2021-02-04 01:50:32.381720 | 0 days 15:31:19.088616 | 2021-02-04 | 01:50:32.381720 |
4 | 2021-02-04 13:47:09.600381 | 0 days 11:56:37.218661 | 2021-02-04 | 13:47:09.600381 |
5 | 2021-02-04 12:30:01.345210 | -1 days +22:42:51.744829 | 2021-02-04 | 12:30:01.345210 |
In [13]:
type(df['date'][0])
Out[13]:
datetime.date
In [14]:
type(df['time'][0])
Out[14]:
datetime.time
참고로 datetime.date끼리는 diff()연산이 가능하지만, datetime.time끼리는 diff()연산이 불가능하다.
In [15]:
date_diff = df['date'].diff()
date_diff
Out[15]:
0 NaT
1 0 days
2 0 days
3 1 days
4 0 days
5 0 days
Name: date, dtype: timedelta64[ns]
In [16]:
date_diff = df['date'][3] - df['date'][2]
date_diff
Out[16]:
datetime.timedelta(days=1)
'Big Data' 카테고리의 다른 글
GraphViz’s executables not found 해결법 (0) | 2021.02.06 |
---|---|
[logistic regression] Warning: Maximum number of iterations has been exceeded. (0) | 2021.02.05 |