일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 | 12 | 13 | 14 |
15 | 16 | 17 | 18 | 19 | 20 | 21 |
22 | 23 | 24 | 25 | 26 | 27 | 28 |
29 | 30 | 31 |
- 타입스크립트
- 프론트엔드
- 리액트
- 캐글
- Kaggle
- do it
- 차원 축소
- 대감집 체험기
- r
- 클러스터링
- 구글
- 쿠버네티스
- python
- React
- 심층신경망
- LDA
- ADP 실기
- ADP
- 대감집
- frontend
- bigquery
- TooBigToInnovate
- 빅쿼리
- 머신러닝
- Kubernetes
- docker
- 파이썬
- Machine Learning
- 최적화
- DBSCAN
- Today
- Total
No Story, No Ecstasy
[Kaggle Pandas] Python basic code 본문
# Pandas
import pandas as pd
# Creating
pd.DataFrame({'Yes': [50,21], 'No': [131, 2]})
df = pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'],
'Sue': ['Pretty good.', 'Bland.']},
index=['Product A', 'Product B'])
print(df)
print(pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A'))
#Reading
df = pd.read_csv("asdf", index_col=0)
#Choosing between loc and iloc
When choosing or transitioning between loc and iloc, there is one "gotcha" worth keeping in mind, which is that the two methods use slightly different indexing schemes.
iloc uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So 0:10 will select entries 0,...,9. loc, meanwhile, indexes inclusively. So 0:10 will select entries 0,...,10.
Why the change? Remember that loc can index any stdlib type: strings, for example. If we have a DataFrame with index values Apples, ..., Potatoes, ..., and we want to select "all the alphabetical fruit choices between Apples and Potatoes", then it's a lot more convenient to index df.loc['Apples':'Potatoes'] than it is to index something like df.loc['Apples', 'Potatoet'] (t coming after s in the alphabet).
This is particularly confusing when the DataFrame index is a simple numerical list, e.g. 0,...,1000. In this case df.iloc[0:1000] will return 1000 entries, while df.loc[0:1000] return 1001 of them! To get 1000 elements using loc, you will need to go one lower and ask for df.loc[0:999].
Otherwise, the semantics of using loc are the same as those for iloc.
#Manipulating the index
df.set_index("column name")
#Conditional Select
df.loc[df.column_name == 'column value']
df.loc[(df.column_name1 == 'column value1') & (df.column_name2 >= 'column value2')]
df.loc[df.column_name.isin(['cand1', 'cand2'])]
df.loc[df.column_name.notnull()]
#Assigning data
df['column_name'] = 'column_value'
df['column_name'] = range(len(df), 0, -1)
#Summary functions
df.column_name.describe() #count, mean, iqr, min/max, etc.
df.column_name.value_counts()
#Maps
mean = df.column_name.mean()
df.column_name.map(lambda x: x - mean)
def remain_func(row):
row.points = row.points - mean
return row
df.apply(remain_func, axis='columns')
df.column1 + "-" + df.column2
#Grouping
df.groupby('ref_column').target_column.count() # ref can be same with target
df.groupby('ref').apply(lambda x: x.title.iloc[0])
df.groupby(['ref1','ref2']).apply(lambda x: x.loc[x.target_column.idxmax()])
'''Another groupby() method worth mentioning is agg(), which lets you run a bunch of different functions on your DataFrame simultaneously. For example, we can generate a simple statistical summary of the dataset as follows:'''
df.groupby(['ref']).target_column.agg([len, min, max])
'''However, in general the multi-index method you will use most often is the one for converting back to a regular index, the reset_index() method:'''
countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len])
countries_reviewed.reset_index()
#Sorting
countries_reviewed = countries_reviewed.reset_index()
countries_reviewed.sort_values(by='len', ascending=False)
countries_reviewed.sort_values(by=['country', 'len'])
countries_reviewed.sort_index() # sort by index
# Data Types
reviews.dtype #Show the type of data
reviews.points.astype('float64') #convert datatype
# Missing data
df[pd.isnull(df.column_name)]
df.column_name.fillna('Unknown')
df.column_name.replace('from', 'to')
# Renaming
df.rename(columns={'from':'to'})
df.rename(index={'from':'to'})
'''Both the row index and the column index can have their own name attribute. The complimentary rename_axis() method may be used to change these names. For example:'''
reviews.rename_axis("wines", axis='rows').rename_axis("fields", axis='columns')
# Combining
#concat
pd.concat([canadian_youtube, british_youtube]) #defalut가 axis=0이 적용 > 행방향(위아래)으로 이어붙인다.
#ignore_index=True을 줘서 인덱스를 재배열 할 수 있다.
result3_in = pd.concat([df1,df2],axis=1, join='inner') #열방향(axis=1), 교집합(inner), 합집합(outer)
#merge
merge_outer = pd.merge(df1,df2, how='outer',on='id')
merge_left = pd.merge(df1,df2, how='left', left_on='stock_name', right_on='name')
#join: index 기반 동작 (merge 기반으로 만들어짐)
left = canadian_youtube.set_index(['title', 'trending_date'])
right = british_youtube.set_index(['title', 'trending_date'])
left.join(right, lsuffix='_CAN', rsuffix='_UK', how='inner/outer')
'Data Science Series' 카테고리의 다른 글
HDBSCAN (Hierarchical DBSCAN) (0) | 2021.05.25 |
---|---|
[Kaggle Visualization] Python basic code (0) | 2021.04.28 |
[Kaggle Feature Engineering] Python basic code (0) | 2021.04.28 |
[Kaggle Data Cleaning] Python basic code (0) | 2021.04.28 |
[Kaggle Intermediate Machine Learning] Python basic code (0) | 2021.04.28 |